Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker Job Class #497

Merged
merged 13 commits into from Nov 10, 2021
Merged

Worker Job Class #497

merged 13 commits into from Nov 10, 2021

Conversation

jan-janssen
Copy link
Member

Follow up to #474

Implementing the workers as separate job class using multiprocessing

@jan-janssen
Copy link
Member Author

jan-janssen commented Nov 5, 2021

Example:

from pyiron import Project
pr_worker = Project("worker")
pr_calc = Project("calc")

# Setup Worker
job_worker = pr_worker.create.job.WorkerJob("runner")
job_worker.server.run_mode.non_modal = True
job_worker.server.cores = 4
job_worker.run()

# Submit Calculation
structure = pr_calc.create.structure.ase.bulk("Al", cubic=True)
structure.set_repeat([10, 10, 10])
for i in range(10):
    job = pr_calc.create.job.Lammps("lmp_" + str(i))
    job.structure = structure
    job.server.run_mode.worker = True
    job.master_id = job_worker.job_id
    job.run()

# Monitoring 
pr_calc.job_table()

class WorkerJob(PythonTemplateJob):
def __init__(self, project, job_name):
super(WorkerJob, self).__init__(project, job_name)
self.input['project'] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PythonTemplateJob facilitates dot notation for input now

@jan-janssen
Copy link
Member Author

An mybinder example for this class is available https://github.com/jan-janssen/pyiron-worker

@jan-janssen
Copy link
Member Author

Waiting for conda-forge/staged-recipes#16822

@jan-janssen jan-janssen marked this pull request as draft November 6, 2021 20:27
for pp, p, job_id in path_lst
]
active_job_ids += [j[2] for j in job_lst]
_ = [pool.put(j) for j in job_lst]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason not to use a plain multiprocessing.Pool & map_async here? I'm a bit wary because I see that aproc uses inspect.getsource to transfer the function to the worker instead of pickle or dill.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list is growing while the worker is already executing it, so a simple map does not work. The current combination in the background uses a Multiprocessing pool plus a queue. In addition I also tried dill and cloudpickle but both failed, this might be related to the use of decorators.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why the growing list is a problem.

Inside the while loop the worker checks the job table for submitted jobs are assigned to it via the master_id and puts them in a list. It could them map_async the list and mark the job ids as 'running' or 'queued' or whatever. Then in the next iteration of the while loop there may be new submitted jobs which can be handled the same way or there may be submitted jobs that were already found in the last iteration, but since we marked them in the last iteration we can exclude them from the map in this iteration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The limitation of multiple map_async calls is that they can not be executed concurrently. So the second call to map_async is only executed once the first one is finished. The async only means you get control of the main process, it does not mean that the workers are asynchronous. That is why I created the aproc interface.

Copy link
Contributor

@pmrv pmrv Nov 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my testing this is not the case. I can call map_async multiple times just fine and as long as there are available workers in the pool. Once the pool is completely used you have to wait naturally, but I assume aproc is the same in this regard.

Here's my test script

import multiprocessing as mp

def pow2(x):
    import time
    print(x)
    time.sleep(10)
    return x**2

with mp.Pool(4) as p:
    r1 = p.map_async(pow2, range(2))
    r2 = p.map_async(pow2, range(4, 8))

    r1.wait()
    print(r1.get())
    r2.wait()
    print(r2.get())

with output

0
1
4
5
<nothing happens here for 10s because we're blocked on r1.wait()>
6
7
[0, 1]
[16, 25, 36, 49]

That shows that the two first jobs from the second map call are run straight away and then the last two once the jobs from the first map are finished. This sounds like exactly what we need to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I tried that before but it is working now, so I reverted to map_async - thanks again for pointing me in this direction.

@jan-janssen jan-janssen marked this pull request as ready for review November 9, 2021 05:12
@jan-janssen
Copy link
Member Author

@pmrv and @liamhuber Any other feedback? Otherwise I would like to merge this soon to be included in the next pyiron_base release.

@pmrv
Copy link
Contributor

pmrv commented Nov 9, 2021

I would like to have a small example in the class or module docstring on how to use the worker. The snippet above would do it already. Otherwise it looks good to me.

@liamhuber
Copy link
Member

@pmrv and @liamhuber Any other feedback? Otherwise I would like to merge this soon to be included in the next pyiron_base release.

I would like to see the class appear in the unit tests. If I understand correctly, as long as the test inherits from pyiron_base._tests.PyironTestCase and the docstring_module attribute is set then example in the docstring will get run -- this is probably already sufficient. @pmrv let me know if I misspoke or if you think any additional test is necessary.

@jan-janssen
Copy link
Member Author

I would like to see the class appear in the unit tests. If I understand correctly, as long as the test inherits from pyiron_base._tests.PyironTestCase and the docstring_module attribute is set then example in the docstring will get run -- this is probably already sufficient. @pmrv let me know if I misspoke or if you think any additional test is necessary.

As we no longer have the example job I am not exactly sure what job to test it with. The ScriptJob which is used for many other tests is not really sufficient.

@liamhuber
Copy link
Member

Why would script job be insufficient?

@niklassiemer
Copy link
Member

As we no longer have the example job I am not exactly sure what job to test it with. The ScriptJob which is used for many other tests is not really sufficient.

However, we have

class ToyJob(PythonTemplateJob):
def __init__(self, project, job_name):
"""A toyjob to test export/import functionalities."""
super(ToyJob, self).__init__(project, job_name)
self.input.data_in = 100
# This function is executed
def run_static(self):
self.status.running = True
self.output.data_out = self.input.data_in + 1
self.status.finished = True
self.to_hdf()

Maybe this is sufficient?

@jan-janssen
Copy link
Member Author

Maybe this is sufficient?

No because the class defined in the test can not be reloaded in the subprocess on the worker. But I got it working with the script job instead.

@jan-janssen jan-janssen merged commit 2e193e2 into master Nov 10, 2021
@delete-merged-branch delete-merged-branch bot deleted the worker branch November 10, 2021 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants