Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelization fixes #2280

Merged
merged 10 commits into from Jan 2, 2024
Merged

Parallelization fixes #2280

merged 10 commits into from Jan 2, 2024

Conversation

pmenczel
Copy link
Member

@pmenczel pmenczel commented Dec 22, 2023

Background

Since I am planning to add an mpi_parallel_map to the parallel module, I had a detailed look at the current implementations of parallel_map and loky_pmap. In the case of timeouts, errors or interrupts, I found surprising behavior. I will first summarize the current behavior in these situations, and then the behavior if my changes are included. It seems difficult to write better unit tests for such timing-sensitive behavior, but I have tested in detail on both Linux (WSL) and Windows.

Current behavior

  parallel_map loky_pmap
Timeout Finishes currently running tasks,
then returns results of all finished tasks.
If fail_fast, behaves similarly to parallel_map.
Otherwise, completes all tasks but only returns
results of those that were started before the timeout.
CTRL+C First CTRL+C interrupts all tasks, but program
enters deadlock requiring second CTRL+C
Interrupts currently running tasks, but then still
executes the rest before raising KeyboardInterrupt.
Task raises
exception
fail_fast: finishes running tasks,
then raises exception.
!fail_fast: raises MapExceptions at the end.
fail_fast: finishes all tasks, then raises exception.
!fail_fast: raises MapExceptions at the end.
Job timeout Ignored Always completes all tasks anyway, see below

New behavior

  parallel_map loky_pmap
Timeout Finishes currently running tasks,
then returns results of all finished tasks.
Aborts currently running tasks,
then returns results of all finished tasks.
CTRL+C First CTRL+C raises KeyboardInterrupt First CTRL+C raises KeyboardInterrupt
Task raises
exception
fail_fast: finishes running tasks,
then raises exception.
!fail_fast: raises MapExceptions at the end.
fail_fast: finishes tasks earlier in the list, then
aborts remaining ones and raises exception.
!fail_fast: raises MapExceptions at the end.
Job timeout Removed from documentation Removed from documentation

Job timeout

Currently, the job timeout parameter is ignored by parallel_map. In loky_pmap, it is not the maximum allowed time for one job, but the maximum time between two job finishes (possibly in different processes). If this time is exceeded, all tasks will still be executed until the end; only the results of the tasks that finished too slowly will be discarded. I do not think that this was the intention of the job timeout parameter?

Unfortunately, both ProcessPoolExecutor (which parallel_map is based on) and its loky version do not support timeouts for single tasks, nor do they support aborting single tasks manually. (The loky one supports killing all worker processes at once.) If we wanted to have a job timeout parameter, we would need to either use non-public API to obtain references to the worker processes and interrupt them manually, or to completely rewrite parallel_map and base it on e.g. multiprocessing.pool.Pool.

Maybe better to just remove the job_timeout parameter? If you agree with that, I will then also remove it from the available options for MultiTrajSolver and all its subclasses.

@coveralls
Copy link

coveralls commented Dec 22, 2023

Coverage Status

coverage: 84.184% (-0.01%) from 84.196%
when pulling a82532a on pmenczel:parallelization
into 7fbb567 on qutip:master.

@Ericgig
Copy link
Member

Ericgig commented Dec 22, 2023

@pmenczel Thank you for sorting that out.
I would say just remove the job_timeout everywhere.

qutip/solver/parallel.py Outdated Show resolved Hide resolved
Copy link
Contributor

@hodgestar hodgestar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a big improvement to me. I asked a couple of questions about whether it can be improved more.

@hodgestar
Copy link
Contributor

@pmenczel Thanks for sorting this all out. It is much better now. Happy for you to merge it when you're ready.

@pmenczel pmenczel merged commit 9263d94 into qutip:master Jan 2, 2024
11 of 12 checks passed
@pmenczel pmenczel mentioned this pull request Jan 18, 2024
2 tasks
@pmenczel pmenczel deleted the parallelization branch March 5, 2024 08:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants