Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cross validation -- option to control # of threads #15

Open
illdopejake opened this issue Nov 5, 2020 · 5 comments
Open

Cross validation -- option to control # of threads #15

illdopejake opened this issue Nov 5, 2020 · 5 comments

Comments

@illdopejake
Copy link

Hiya Leon et al.,

I ran into an interesting issue when the sys admin of my cluster reached out about some problematic processes that were instantiated when running a parallelized version of the SuStaIn cross-validation. The wrapper script looked something like this:

from pySuStaIn import Zscore SuStaIn
import multiprocessing as mp

sustain_input = ZscoreSuStaIn(args)
test_idxs = <a list of lists>

jobs = []
NFolds = 10
for fold in range(NFolds):
    p = mp.Process(target = target = sustain_input.cross_validate_sustain_model,
                             args = (test_idxs,fold))
    jobs.append(p)
    p.start()

This script was then submitted to the cluster with an .sh script specifying some parameters, such as the number of nodes and cores (in this case I asked for 1 node and 32 cores). However, it seems that individual jobs were themselves starting several other threads/processes. In this sense, they were overriding the specifications on my .sh script. The result was me asking for 32 cores, but having 32^2 threads running on the node. This results in many context switches and inefficient use of the processors on the node.

I admit this is kind of a niche issue and maybe folks don't care so much about how efficient the code is. But I think this issue might be surmounted quite easily by allowing an argument where the user can control the internal parallelization to some degree, a la the n_jobs framework in sklearn. As is, the parallel qualities do not seem to be controllable by the user.

Forgive me if this isn't clear. Would be happy to provide greater detail!

As always, thanks for such making this amazing library!

@sea-shunned
Copy link
Member

Hi Jake,

In your args, do you have use_parallel_startpoints = True? If so, this itself is doing some multiprocessing in the maximum likelihood part, which would explain the exploding threads! Disabling this (use_parallel_startpoints = False) should keep things consistent for your parallelization across folds. This may cause the individual runs to be slower, but doing 10 (in this example) simultaneously should be faster overall than internal parallelization.

If you did set use_parallel_startpoints = False, then...well, I'll have to have a think 😄

A possible route for the future may be, as you say, to embed finer detail (e.g. n_cpus) for each run, rather than a binary serial/parallelize, or just better exposing the options of pathos (which is used underneath), so thank you for bringing this up!

@illdopejake
Copy link
Author

Hi Cameron,

Thanks so much for looking into this. I appreciate you bringing the use_parallel_startpoints argument to my attention. I hadn't really paid much attention to it, and I'm glad to know about it now. But, in the instance where I encountered the error described in this issue, I actually has use_parallel_startpoints set to False. So there must be somewhere else in the code that is leading to all these processes initiating.

@sea-shunned
Copy link
Member

That is interesting!

Underneath, numpy does some parallelization that is externally controlled depending on the libraries that the cluster is using, i.e. if it is using BLAS/OpenBLAS, MKL etc. This StackOverflow post explains a possible way to address this, which may be easy or hard to do, depending on the cluster setup!

@illdopejake
Copy link
Author

Great, I had no idea numpy was doing that sort of thing, I guess because seems like it only becomes relevant with processes on really large arrays? I will give this a try next time and will report back on whether it resolves the issues. Thanks again!

@sea-shunned
Copy link
Member

Yep, and sometimes parallelization has more overhead cost than the time you save (if the arrays aren't big enough), so it can pay to adjust the settings.

I'll keep this issue open for now — if setting those environment variables or anything else does fix the issue please let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants