Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved Parallelization for Training and Inference #129

Closed
wants to merge 37 commits into from

Conversation

robbymeals
Copy link

Jan 8, 2015 EDIT: Adding TL;DR to this original comment to make it easier to review, since this PR has become something of a monster.

Summary of changes:
Added mixin class ParallelMixin in new pystruct.utils.parallel that does the following:

  • adds private method _spawn_pool to spawn a pool as attribute
  • pool can be ThreadPool, MemmapingPool or Pool, depending on ecosystem and parameters
  • exposes public method parallel with same api as standard library map, taking as args function to be mapped and mappable args iterable
  • __getstate__ and __setstate__ handle pool attribute
  • parallel method handles KeyboardInterrupt somewhat gracefully, allowing for user interruption that stops training but preserves model object and pool in usable forms.
  • removal of all Parallel calls and any parallelization logic from anywhere in module other than utils.parallel
  • added hacky wrappers for all functions that are currently used as mappable, could potentially change to decorator but then would have to update usage of original function elsewhere.
  • other stuff?

Original comment:

Hi! So using Parallel causes the respawning of pools with every iteration, which can be slow from the start, causes memory issues and is very slow once the matrices get to be at all large. @amueller what do you think of this approach? I am sure there are other moving parts that I'm not aware of that need to be considered, but replacing the Parallel calls as in this PR in both OneSlackSSVM and SubgradientSSVM worked really well for me.

@robbymeals
Copy link
Author

Right, would have add check for sklearn version to ensure it is available. You could move the creation of self.pool into SSVM base object. I think even just using a pre-instantiated Pool would work, just have to do more of the annoying error handling and stuff required when you use multiprocessing directly.

You could also conditionally instantiate and terminate self.pool in the fit method.

@amueller
Copy link
Member

amueller commented Jan 2, 2015

Thanks for the PR.
I am a bit surprised that this has an impact on performance. Could you give some numbers / benchmarks?

@robbymeals
Copy link
Author

Hey, sure! So currently I am playing around with this on an r3.4xlarge ec2 box that I have running for other tasks, so 16 vcpu and 122 gb RAM. So not every day capacity. I can't really share data that I am using, but it is text data and I am currently just looking at the MultilabelClf and the GridCrf models. I will mock up a more detailed example to show you what I'm talking about on data that looks more similar to mine. But just using your multi_label.py example, adding ``n_jobs=-1 to the learner instantiations, here are the benchmarks I get for the scene dataset:

running first with my current master branch, using vanilla Pool not MemmapingPool:

(nlp-venv)nlp@ip-10-153-157-196:~/benchmarks$ time python multi_label.py
fitting independent model...
fitting full model...
fitting tree model...
Training loss independent model: 0.066612
Test loss independent model: 0.111204
Training loss tree model: 0.059868
Test loss tree model: 0.106048
Training loss full model: 0.049408
Test loss full model: 0.097408

real    0m58.619s
user    1m41.180s
sys     0m5.705s

and then with your master, using Parallel:

(nlp-venv)nlp@ip-10-153-157-196:~/benchmarks$ time python multi_label.py
fitting independent model...

fitting full model...
fitting tree model...
Training loss independent model: 0.066612
Test loss independent model: 0.111204
Training loss tree model: 0.059868
Test loss tree model: 0.106048
Training loss full model: 0.049408
Test loss full model: 0.097408

real    4m3.351s
user    5m49.086s
sys     1m17.763s

I'll run a couple other examples to get similar comparisons for other datasets and models.
sorry for all the commits, this pr is just meant to illustrate.

@amueller
Copy link
Member

amueller commented Jan 2, 2015

Sorry, did you mean n_jobs=1 or n_jobs=-1? with n_jobs=1 there is hopefully no difference.

@robbymeals
Copy link
Author

Yes n_jobs=-1 sorry

@robbymeals
Copy link
Author

These are the changes to the multi_label.py example:

full_ssvm = OneSlackSSVM(full_model, inference_cache=50, C=.1, tol=0.01, n_jobs=-1)

tree_ssvm = OneSlackSSVM(tree_model, inference_cache=50, C=.1, tol=0.01, n_jobs=-1)

independent_ssvm = OneSlackSSVM(independent_model, C=.1, tol=0.01, n_jobs=-1)

@amueller
Copy link
Member

amueller commented Jan 2, 2015

That looks pretty promising. I want to run a couple more tests on the larger graphs (or you can, if you like). I'd like to understand what is happening a bit better. maybe @ogrisel can help. I did not think Parallel would have such an overhead over Pool. Maybe it is because the models are so small and inference is so quick?

@robbymeals
Copy link
Author

No, it's even worse on larger models. The models I'm playing around with are much larger than these and the problem gets worse with larger input matrices and actually worse with each iteration. I have run into this before in other contexts. I think it is just because when Parallel reinstantiates the pool with every call and so stuff has to get copied to the subprocesses again each loop. Or whatever the correct way of saying that actually is. :)

Yeah I'd be happy to run them, did you have any specific examples in mind?

Btw, here's the benchmarks for multilabel.py for the yeast dataset:

My master, using Pool:

(nlp-venv)nlp@ip-10-153-157-196:~/benchmarks$ time python multi_label.py
fitting independent model...
fitting full model...
fitting tree model...
Training loss independent model: 0.191571
Test loss independent model: 0.199330
Training loss tree model: 0.191333
Test loss tree model: 0.200499
Training loss full model: 0.191095
Test loss full model: 0.199797

real    3m42.851s
user    6m15.402s
sys     0m8.750s

and current master using Parallel:

(nlp-venv)nlp@ip-10-153-157-196:~/benchmarks$ time python multi_label.py
fitting independent model...
fitting full model...
fitting tree model...
Training loss independent model: 0.191571
Test loss independent model: 0.199330
Training loss tree model: 0.191333
Test loss tree model: 0.200499
Training loss full model: 0.191095
Test loss full model: 0.199797

real    9m11.171s
user    13m14.096s
sys     2m13.286s

Sidenote: kudos on this package, it's very exciting!

@amueller
Copy link
Member

amueller commented Jan 2, 2015

Thanks. I hope to improve it a lot in the near future, with respect to documentation, fexibility and sparse matrix support.
The snake example would be interesting, and the image segmentation example, too.

@robbymeals
Copy link
Author

Ok those two are running now. Taking the approach I have here, with self.pool, I'd have to use https://docs.python.org/2/library/copy_reg.html to register a pickler that is aware of the pool and ignores it when dumping and recreates it if necessary when loading. I've done that before for very similar use cases, it's not that painful. I think that's why the checks are failing?

@robbymeals
Copy link
Author

Not seeing as much of difference in these examples:

image segmentation benchmarks using Pool:

...
real    16m47.309s
user    28m43.323s
sys     0m49.150s
(nlp-venv)nlp@ip-10-153-157-196:~/benchmarks$

and using Parallel:

...
real    16m49.758s
user    29m11.565s
sys     0m49.593s

and plot_snakes.py, using Pool:

(nlp-venv)nlp@ip-10-153-157-196:~/benchmarks$ time python plot_snakes.py
Please be patient. Learning will take 5-20 minutes.
Results using only directional features for edges
Test accuracy: 0.829
[[2750    0    0    0    0    0    0    0    0    0    0]
 [   0   98    0    0    1    0    0    0    1    0    0]
 [   0    6   38    3   34    8    1    2    5    1    2]
 [   0    9    8   10    8   41    1   12    3    7    1]
 [   0    1   14    2   37    8    1    9   21    5    2]
 [   0    4    2    9   16   29    2   19   11    7    1]
 [   0    2   13    3   30   16    2    7   20    5    2]
 [   0    7    5    8   15   29    3   14    8   11    0]
 [   0    3   10    3   29   10    1    6   20    3   15]
 [   0    9    3    2   10    8    0   15    4   46    3]
 [   0    2    7    3    9    1    1    3    7    3   64]]
Results using also input features for edges
Test accuracy: 0.998
[[2749    0    0    0    0    0    0    0    1    0    0]
 [   0  100    0    0    0    0    0    0    0    0    0]
 [   0    0  100    0    0    0    0    0    0    0    0]
 [   0    0    0  100    0    0    0    0    0    0    0]
 [   0    0    0    0   98    0    2    0    0    0    0]
 [   0    0    0    0    0   99    0    1    0    0    0]
 [   0    0    0    0    0    0  100    0    0    0    0]
 [   0    0    0    0    0    1    0   99    0    0    0]
 [   0    0    0    0    0    0    0    0  100    0    0]
 [   0    0    0    0    0    0    0    1    0   99    0]
 [   0    0    0    0    0    0    0    0    0    0  100]]

real    5m5.135s
user    21m33.105s
sys     0m5.142s

and using Parallel:

(nlp-venv)nlp@ip-10-153-157-196:~/benchmarks$ time python plot_snakes.py
Please be patient. Learning will take 5-20 minutes.
Results using only directional features for edges
Test accuracy: 0.829
[[2750    0    0    0    0    0    0    0    0    0    0]
 [   0   98    0    0    1    0    0    0    1    0    0]
 [   0    6   38    3   34    8    1    2    5    1    2]
 [   0    9    8   10    8   41    1   12    3    7    1]
 [   0    1   14    2   37    8    1    9   21    5    2]
 [   0    4    2    9   16   29    2   19   11    7    1]
 [   0    2   13    3   30   16    2    7   20    5    2]
 [   0    7    5    8   15   29    3   14    8   11    0]
 [   0    3   10    3   29   10    1    6   20    3   15]
 [   0    9    3    2   10    8    0   15    4   46    3]
 [   0    2    7    3    9    1    1    3    7    3   64]]
Results using also input features for edges
Test accuracy: 0.998
[[2749    0    0    0    0    0    0    0    1    0    0]
 [   0  100    0    0    0    0    0    0    0    0    0]
 [   0    0  100    0    0    0    0    0    0    0    0]
 [   0    0    0  100    0    0    0    0    0    0    0]
 [   0    0    0    0   98    0    2    0    0    0    0]
 [   0    0    0    0    0   99    0    1    0    0    0]
 [   0    0    0    0    0    0  100    0    0    0    0]
 [   0    0    0    0    0    1    0   99    0    0    0]
 [   0    0    0    0    0    0    0    0  100    0    0]
 [   0    0    0    0    0    0    0    1    0   99    0]
 [   0    0    0    0    0    0    0    0    0    0  100]]

real    5m6.173s
user    21m33.023s
sys     0m5.185s

@robbymeals
Copy link
Author

Also following up on sidenote, I have done some work to get sparse matrix support, have you already made lots of progress on that? If so, maybe a feature branch would make sense, so no one reinvents your wheels?

@ogrisel
Copy link

ogrisel commented Jan 3, 2015

I want to run a couple more tests on the larger graphs (or you can, if you like). I'd like to understand what is happening a bit better. maybe @ogrisel can help. I did not think Parallel would have such an overhead over Pool. Maybe it is because the models are so small and inference is so quick?

Yes I would like to experiment with a version of joblib that would be able to reuse an existing instance of a pool of worker processes. Typically I would like joblib to create a single pool with n_cpus workers by default and use a subset of those workers when n_jobs < n_cpus and all the workers when n_jobs == -1 But it's tedious to sub-slices an existing pool with the multiprocessing API. It was not designed to do this...

@ogrisel
Copy link

ogrisel commented Jan 3, 2015

Ok those two are running now. Taking the approach I have here, with self.pool, I'd have to use https://docs.python.org/2/library/copy_reg.html to register a pickler that is aware of the pool and ignores it when dumping and recreates it if necessary when loading. I've done that before for very similar use cases, it's not that painful. I think that's why the checks are failing?

The default multiprocessing Pool is very restricted. The MemmapingPool of joblib makes it possible to:

  • register custom picklers,
  • avoid memory copy when the input data is already a np.memmap instance,
  • automatically memory map large numpy arrays to limit the number of memory copies when the same array is used in many concurrent workers.

@robbymeals
Copy link
Author

@ogrisel yes my intention definitely would be to use MemmapingPool in any real implementation, I have reverted to Pool just because one of the travis checks is using a version of scikit-learn without the sklearn.externals.joblib.pool submodule. I was talking about pickling the model object as in the logger functionality, which would require registering a function to strip out the self.pool attribute in some smart way on dump and reinstantiating it on load, I think.

@robbymeals
Copy link
Author

Yes I would like to experiment with a version of joblib that would be able to reuse an existing instance of a pool of worker processes. Typically I would like joblib to create a single pool with n_cpus workers by default and use a subset of those workers when n_jobs < n_cpus and all the workers when n_jobs == -1 But it's tedious to sub-slices an existing pool with the multiprocessing API. It was not designed to do this...

@ogrisel How were you thinking of referencing the pool, namespace wise? I have had difficulty figuring out a way to do this in other applications that didn't involve ugly global hacks, if that makes sense, but I may be missing some key piece.

@robbymeals
Copy link
Author

@amueller so if i did some further work to try and implement this approach more fully, would you consider inclusion in the project, obviously after review and approval and whatever else? or would you want to build it out yourself?

@robbymeals
Copy link
Author

if it does interest you, happy to take marching orders before undertaking it. will probably do some version of it for use internally at my job, but would rather do it in a way that can be pushed back into the project and conform to your standards and practices.

@robbymeals
Copy link
Author

I enabled use of the learner pool for inference as well, got a few more seconds for the multilabel examples and some noticeable improvement on the other examples, rerunning those now. Still failing four tests in the crammer singer test file, can't figure out why.

multilabel, scenes dataset:

(nlp-venv)nlp@ip-10-153-157-196:~/benchmarks$ time python multi_label.py
fitting independent model...
fitting full model...
fitting tree model...
Training loss independent model: 0.066612
Test loss independent model: 0.111204
Training loss tree model: 0.059868
Test loss tree model: 0.106048
Training loss full model: 0.049408
Test loss full model: 0.097408

real    0m51.503s
user    1m30.192s
sys     0m2.084s

multilabel, yeast dataset:

(nlp-venv)nlp@ip-10-153-157-196:~/benchmarks$ time python multi_label.py
fitting independent model...
fitting full model...
fitting tree model...
Training loss independent model: 0.191571
Test loss independent model: 0.199330
Training loss tree model: 0.191333
Test loss tree model: 0.200499
Training loss full model: 0.191095
Test loss full model: 0.199797

real    3m38.756s
user    6m13.319s
sys     0m3.004s

@robbymeals
Copy link
Author

(nlp-venv)nlp@ip-10-153-157-196:~/benchmarks$ time python plot_snakes.py
Please be patient. Learning will take 5-20 minutes.
Results using only directional features for edges
Test accuracy: 0.829
[[2750    0    0    0    0    0    0    0    0    0    0]
 [   0   98    0    0    1    0    0    0    1    0    0]
 [   0    6   38    3   34    8    1    2    5    1    2]
 [   0    9    8   10    8   41    1   12    3    7    1]
 [   0    1   14    2   37    8    1    9   21    5    2]
 [   0    4    2    9   16   29    2   19   11    7    1]
 [   0    2   13    3   30   16    2    7   20    5    2]
 [   0    7    5    8   15   29    3   14    8   11    0]
 [   0    3   10    3   29   10    1    6   20    3   15]
 [   0    9    3    2   10    8    0   15    4   46    3]
 [   0    2    7    3    9    1    1    3    7    3   64]]
Results using also input features for edges
Test accuracy: 0.998
[[2749    0    0    0    0    0    0    0    1    0    0]
 [   0  100    0    0    0    0    0    0    0    0    0]
 [   0    0  100    0    0    0    0    0    0    0    0]
 [   0    0    0  100    0    0    0    0    0    0    0]
 [   0    0    0    0   98    0    2    0    0    0    0]
 [   0    0    0    0    0   99    0    1    0    0    0]
 [   0    0    0    0    0    0  100    0    0    0    0]
 [   0    0    0    0    0    1    0   99    0    0    0]
 [   0    0    0    0    0    0    0    0  100    0    0]
 [   0    0    0    0    0    0    0    1    0   99    0]
 [   0    0    0    0    0    0    0    0    0    0  100]]

real    4m54.910s
user    21m15.152s
sys     0m4.272s

@robbymeals
Copy link
Author

@amueller i still can't figure out why those four crammer-singer tests are failing. probably something simple I just don't know about.

@robbymeals
Copy link
Author

multilabel.py benchmarks, for n_jobs=-1 runs with

  1. default OneSlackSSVM config (using MemmapingPool),
  2. use_memmapping_pool=0, and
  3. use_threads=1,

respectively:

(nlp-venv)nlp@ip-10-234-187-58:~/benchmarks$ time python multi_label.py
fitting independent model...
fitting full model...
fitting tree model...
Training loss independent model: 0.066612
Test loss independent model: 0.111204
Training loss tree model: 0.059868
Test loss tree model: 0.106048
Training loss full model: 0.049408
Test loss full model: 0.097408

real    1m32.018s
user    2m1.699s
sys     0m13.477s
(nlp-venv)nlp@ip-10-234-187-58:~/benchmarks$ time python multi_label.py
fitting independent model...
fitting full model...
fitting tree model...
Training loss independent model: 0.066612
Test loss independent model: 0.111204
Training loss tree model: 0.059868
Test loss tree model: 0.106048
Training loss full model: 0.049408
Test loss full model: 0.097408

real    0m49.551s
user    1m29.923s
sys     0m2.540s
(nlp-venv)nlp@ip-10-234-187-58:~/benchmarks$ time python multi_label.py
fitting independent model...
fitting full model...
fitting tree model...
Training loss independent model: 0.066612
Test loss independent model: 0.111204
Training loss tree model: 0.059868
Test loss tree model: 0.106048
Training loss full model: 0.049408
Test loss full model: 0.097408

real    1m46.923s
user    2m53.827s
sys     0m16.664s
(nlp-venv)nlp@ip-10-234-187-58:~/benchmarks$ 

@robbymeals
Copy link
Author

After some extensive testing, I'm actually leaning towards having vanilla Pool be default, since it is so much faster, and have MemmapingPool be option. But I'll stop making changes here until review.

@amueller
Copy link
Member

amueller commented Feb 6, 2015

Hey. Just wanted to say this is not forgotten ;) I'll try to work on it next week. I had a bunch of scikit-learn stuff to do as we want to release soon.
Thank you so much for your extensive testing.

Btw, have you looked into sparse matrix support any more?

@robbymeals
Copy link
Author

No worries, I follow scikit-learn repo, so I figured that was it, given the crazy number of times you've popped up in the newsfeed. :)

RE: sparse matrix support, I think I am missing some fundamental pieces of how to get there from here, honestly. The stuff I came up with is pretty janky and probably not a good starting point. I have been using LSA (TruncatedSVD) in my prototype, to temporarily get around the sparse support issue.

I am definitely willing and able to work on it but I think I would need a bit of guidance from you to make real progress. If you have a bit of time, maybe you could do a high level brain dump of how you envision the best way to implement it and I could work off of that.

@amueller
Copy link
Member

amueller commented Feb 6, 2015

If you don't have a working prototype, I think I'll just try and work it out. I wouldn't want to point in a wrong direction.

@robbymeals
Copy link
Author

Yes, I think that's probably best, if you have the time.

@robbymeals
Copy link
Author

Hey so I am trying to get my master (with parallelization changes) up to date with your changes and i am failing some tests in the docs and in python3. I am currently using my fork as the dependency for stuff I am working on. Totally willing to do the work to make those tests pass, but didn't want to waste time if you think my implementation is a nonstarter. Again, no worries if so, I am actually reusing the parallel utils I created here in other stuff so it wasn't a waste of time on my end regardless :).

@amueller
Copy link
Member

I'm sorry I still haven't looked at your contribution.
I think they are good, but I had too much else to work on.
Did you sync with my master? And there are doctests in the userguide
failing?
I guess I have to check with travis, I didn't try the doctests on python3.

On 04/23/2015 11:47 AM, Robert Mealey wrote:

Hey so I am trying to get my master (with parallelization changes) up
to date with your changes and i am failing some tests in the docs and
in python3. I am currently using my fork as the dependency for stuff I
am working on. Totally willing to do the work to make those tests
pass, but didn't want to waste time if you think my implementation is
a nonstarter. Again, no worries if so, I am actually reusing the
parallel utils I created here in other stuff so it wasn't a waste of
time on my end regardless :).


Reply to this email directly or view it on GitHub
#129 (comment).

@amueller
Copy link
Member

It looks like the doctest failure are actually caused by your changes.
Feel free to fix them, but it is not that urgent. I hope I have time to
look at your changes soon, but I will probably add sparse matrix support
before that.

Cheers,
Andy

On 04/23/2015 11:47 AM, Robert Mealey wrote:

Hey so I am trying to get my master (with parallelization changes) up
to date with your changes and i am failing some tests in the docs and
in python3. I am currently using my fork as the dependency for stuff I
am working on. Totally willing to do the work to make those tests
pass, but didn't want to waste time if you think my implementation is
a nonstarter. Again, no worries if so, I am actually reusing the
parallel utils I created here in other stuff so it wasn't a waste of
time on my end regardless :).


Reply to this email directly or view it on GitHub
#129 (comment).

@robbymeals
Copy link
Author

Sorry, yeah, my comment wasn't all that clear.
Right, I was syncing with your master and there were a few failed tests that seemed to be caused by new params added in my changes in the docs. The python3 failures are different, those are core tests failing for some reason that I need to dig into. Basically just making sure you weren't trying to politely decline my pr before putting the work in to fix tests that don't directly affect the work I am using the package for :) No worries on time, I get it.

@amueller
Copy link
Member

I think your approach is good, and it would be great if you could try to
stay in sync with master.
I hope there will be a couple of new features arriving soon.

On 04/23/2015 03:55 PM, Robert Mealey wrote:

Sorry, yeah, my comment wasn't all that clear.
Right, I was syncing with your master and there were a few failed
tests that seemed to be caused by new params added in my changes in
the docs. The python3 failures are different, those are core tests
failing for some reason that I need to dig into. Basically just making
sure you weren't trying to politely decline my pr before putting the
work in to fix tests that don't directly affect the work I am using
the package for :) No worries on time, I get it.


Reply to this email directly or view it on GitHub
#129 (comment).

@robbymeals
Copy link
Author

Ok will do. I actually have made a bit more progress on sparse matrix support, still pretty hacky but I will put it up in a feature branch if you want to take a look.

@amueller
Copy link
Member

amueller commented May 3, 2015

On 04/25/2015 12:32 PM, Robert Mealey wrote:

Ok will do. I actually have made a bit more progress on sparse matrix
support, still pretty hacky but I will put it up in a feature branch
if you want to take a look.

Definitely :) please submit a PR. It is on my to-do for the next one or
two weeks.

Cheers,
Andy

@amueller
Copy link
Member

It would be interesting to bench your approach against the improved joblib here:
joblib/joblib#157 (comment)

It still allocates many more pools then necessary, so it might not be as good.

@amueller
Copy link
Member

Also, pystruct has gotten not enough attention lately, sorry :-/

@amueller
Copy link
Member

the branch I mentioned above was merged into scikit-learn. Can you check your patch against current pystruct with the dev branch of scikit-learn please?

@ogrisel
Copy link

ogrisel commented Jul 16, 2015

Also there is a new context manager API in the work at: joblib/joblib#221. Feel free to test that branch with the monkey-patch mentioned in the first comment.

@robbymeals
Copy link
Author

I'll try and do this by the end of this weekend. Bit swamped at the moment.

@robbymeals
Copy link
Author

Hey a long time later, is this still an optimization you're interested in pursuing? If you've come up with a better solution that I missed or this isn't the direction you want to go, happy to close this PR.

@robbymeals
Copy link
Author

oh I see that I said I would try against latest joblib. I can do that it if you haven't, if this is still something worth looking into.

@amueller
Copy link
Member

amueller commented May 2, 2016

Hey. Yeah so the latest joblib has pools, so that might be a simpler solution than yours. I haven't benchmarked it, and I think it would be worth to try. I'd love to have the better of the two in ;)

@robbymeals robbymeals closed this Feb 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants