Skip to content
This repository has been archived by the owner on Dec 6, 2023. It is now read-only.

[MRG] Parallelize OvR method in primal_cd #13

Merged
merged 3 commits into from Nov 18, 2014

Conversation

MechCoder
Copy link
Contributor

@mblondel I was trying to get some speed gains by parallelizing the OvR method. However when I set n_jobs>1 it keeps failing with this error, TypeError: __cinit__() takes exactly 1 positional argument (0 given). Note that it works like how it is supposed to for n_jobs=1

@mblondel
Copy link
Member

I think the function to parallelize must be pure (no side effects).

@MechCoder
Copy link
Contributor Author

@mblondel Do you mean somehow X gets deleted in between __init__ and __cinit__ . or is it because we are modifying coef_ in place across all processes. Or is it something else?

@mblondel
Copy link
Member

Because coef_ is modified in place.

@mblondel
Copy link
Member

I'd love to remove the gil in the CD code in lightning. This would allow to use threading like for random forests in scikit-learn.

@mblondel
Copy link
Member

I think the function to parallelize must be pure (no side effects).

Please confirm with Gael.

@MechCoder
Copy link
Contributor Author

If the GIL is disabled, it will allow to have shared memory right then we would not run into such problems?
Also the error message is highly non-intuitive to me.

@mblondel
Copy link
Member

Can you make your contributions to lightning first? We can discuss merging to scikit-learn eventually.

@MechCoder
Copy link
Contributor Author

@ogrisel @GaelVaroquaux Can you please have a quick look?

@MechCoder MechCoder changed the title Parallelize OvR method in primal_cd [WIP] Parallelize OvR method in primal_cd Sep 18, 2014
@ogrisel
Copy link

ogrisel commented Sep 18, 2014

The type error message is strange but I an not familiar with this code.

If the GIL is disabled, it will allow to have shared memory right then we would not run into such problems?

The GIL is never "disabled". When the GIL is released, it means that you can use threading efficiently on multiple CPUs (Parallel with backend='threading') instead of using the default multiprocessing. And yes the main difference between threading and multiprocessing is that the memory is shared by default between threads of the same process while is not shared by default between different processes.

Releasing the GIL will not change anything if you don't use multiple threads.

@MechCoder
Copy link
Contributor Author

@ogrisel Thanks for your explanation (as usual).

My question was that when we use a multiprocessing backend, we should make sure that the data shared among all processes should not be modified in place right? For instance, here coef is modified across all processes. So I wanted to clarify if this could be the source of an error.

@ogrisel
Copy link

ogrisel commented Sep 19, 2014

As I said, when you use the multiprocessing backend of joblib (the default backend), nothing is shared, unless you use memory mapping for the input data (usually readonly so not a problem) or some fitted parameters such as coef. It's better to use the threading backend to concurrently update fitted parameters. Concurrent writes on memory mapped data is likely to cause platform specific technical issues (although I have not much experience with that.

Having concurrent parameter updates can break theoretical convergence. It depends on the algo. I think for coordinate descent you typically have many more coordinates than CPUs. If you select the coordinate to update randomly, the likelyhood of having 2 threads or processes update the same coordinate concurrently is very low, so it should not cause much problems in practice.

HogWild! for running concurrent SGD (not CD) on very sparse data demonstrates that in practice this can be very beneficial. My intuition is that doing lock free concurrent stochastic Coordinate Descent should work very well as well as long as the number of features is significantly larger than the number of threads.

@ogrisel
Copy link

ogrisel commented Sep 19, 2014

Also the speed up might be harmed by the problem of false sharing but it would require extensive experiments to see how detrimental it is in practice for CD on non-toy problems.

@MechCoder
Copy link
Contributor Author

Okay, I was just confused because of this comment by Mathieu,

Because coef_ is modified in place.

.

Having concurrent parameter updates can break theoretical convergence.

Well here the parameter updates are technically independent across every job, since it is a OvA implementation, and the update done by each job should be independent of the other.

So the solution now is to have a GIL released threading based backend? But I still do not understand the error when the multiprocessing backend is enabled. What could be the possible reason as no memory is shared among all processes?

My intuition according to my work in the cd_fast solver was that across each job, the initialized coefs supplied to each job (and which are modified in place) are different. But in this code, the same coef is supplied across all jobs, which is modified simultaneously. So I think this would be a problem right?

@ogrisel
Copy link

ogrisel commented Sep 19, 2014

What could be the possible reason as no memory is shared among all processes?

Again: processes do not share any memory by default. Only thread do. This is the main difference between a thread and a process.

Sorry I did not read the sentence correctly. I don't know about the cython error but it seems completely unrelated to sharing memory or not.

@MechCoder
Copy link
Contributor Author

Thats what I had meant! :)

On Fri, Sep 19, 2014 at 7:45 PM, Olivier Grisel notifications@github.com
wrote:

What could be the possible reason as no memory is shared among all
processes?

Again: processes do not share any memory by default. Only thread do. This
is the main difference between a thread and a process.


Reply to this email directly or view it on GitHub
https://github.com/mblondel/lightning/pull/13#issuecomment-56210315.

Godspeed,
Manoj Kumar,
Intern, Telecom ParisTech
Mech Undergrad
http://manojbits.wordpress.com

@MechCoder
Copy link
Contributor Author

I see you have edited your comment. sorry

@mblondel
Copy link
Member

Well, I've never seen joblib.Parellel used like that in scikit-learn with the multi-processing backend. So perhaps the error message is unrelated but in any case what you've tried won't work as memory is not shared (and the fact that the code works with n_jobs=1 but not with n_jobs > 1 does seem to indicate that the problem is with the way you parallelize!). What we usually do in scikit-learn is to allocate memory to be written from within the function to be parallelized and return the array (not write it in place). Then we agglomerate the arrays to form a 2d array.

@ogrisel What @MechCoder is doing is embarassingly parallel.

@MechCoder
Copy link
Contributor Author

What we usually do in scikit-learn is to allocate memory to be written from within the function to be parallelized and return the array

Yes I do know that, I will give it a shot now.

@MechCoder
Copy link
Contributor Author

I thought "embarassingly" was an adjective, till I googled both together :)

@mblondel
Copy link
Member

Technically, it's an adverb :)

@MechCoder
Copy link
Contributor Author

@mblondel I'm sorry that "now" meant 6 days later, but I believe that I have made the data that each OvR loop writes in place is independent of one other, Can you just have a quick look? If what I have done is right, then its not with the parallelization but with something deeper.

@MechCoder
Copy link
Contributor Author

@mblondel This seems to be a bug with joblib, I filed a bug report here, joblib/joblib#169

@MechCoder
Copy link
Contributor Author

@mblondel The only solution is to have a threading based backend to overcome this error. wdyt?

@mblondel
Copy link
Member

I would start by trying to find why the dataset classes don't pickle.

@MechCoder
Copy link
Contributor Author

This is because of the cinit constructor. I have provided a minimal
example in the joblib issue report.

On Sun, Sep 28, 2014 at 2:16 PM, Mathieu Blondel notifications@github.com
wrote:

I would start by trying to find why the dataset classes don't pickle.


Reply to this email directly or view it on GitHub
https://github.com/mblondel/lightning/pull/13#issuecomment-57083720.

Godspeed,
Manoj Kumar,
Intern, Telecom ParisTech
Mech Undergrad
http://manojbits.wordpress.com

@mblondel
Copy link
Member

How can this be fixed?

@MechCoder
Copy link
Contributor Author

How can this be fixed?

I'm not sure if this question is rhetoric, but I think I figured it out, https://github.com/mblondel/lightning/pull/15

@mblondel
Copy link
Member

No that was an actual question ;-)

Return the coefs and errors from primal_cd modified inplace.
@MechCoder MechCoder changed the title [WIP] Parallelize OvR method in primal_cd [MRG] Parallelize OvR method in primal_cd Oct 6, 2014
@MechCoder
Copy link
Contributor Author

@mblondel Finally done with this. All tests pass, so I assume it works. I did not test the timing vigorously, but it does cut down the timing from this example http://www.mblondel.org/lightning/auto_examples/document_classification_news20.html#document-classification-news20-py from 10 to 5 odd seconds.

@@ -194,7 +200,7 @@ def __init__(self, loss="squared_hinge", penalty="l2", multiclass=False,
warm_debiasing=False,
selection="cyclic", permute=True,
callback=None, n_calls=100,
random_state=None, verbose=0):
random_state=None, verbose=0, n_jobs=-1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer n_jobs=1 by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because the overhead to do Parallel computation, would be much more than the benefits in small datasets?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • we don't want to use all cores silently
  • multiprocessing has some issues on Windows
  • for consistency with scikit-learn

Perhaps @ogrisel or @GaelVaroquaux know other more compelling arguments.

@MechCoder
Copy link
Contributor Author

@mblondel done. merge?

@mblondel
Copy link
Member

mblondel commented Oct 8, 2014

Give me some time. I want to carefully review this.

@MechCoder
Copy link
Contributor Author

Give me some time.

Sure :)

@MechCoder
Copy link
Contributor Author

@mblondel Any news on this? ;)

@mblondel mblondel merged commit 1239b8f into scikit-learn-contrib:master Nov 18, 2014
@mblondel
Copy link
Member

Finally had the time to review and merge. Sorry it took so long! Thank you @MechCoder :)

@MechCoder
Copy link
Contributor Author

great, I thought there were merge conflicts. did you remove them manually?

@MechCoder MechCoder deleted the ovr branch November 18, 2014 13:54
@mblondel
Copy link
Member

The conflicts were in the generated cpp so I just regenerated it.

@MechCoder
Copy link
Contributor Author

thanks !

On Tue, Nov 18, 2014 at 3:11 PM, Mathieu Blondel notifications@github.com
wrote:

The conflicts were in the generated cpp so I just regenerated it.


Reply to this email directly or view it on GitHub
https://github.com/mblondel/lightning/pull/13#issuecomment-63475830.

Godspeed,
Manoj Kumar,
Intern, Telecom ParisTech
Mech Undergrad
http://manojbits.wordpress.com

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants