Reimplement the flux interface #1154

jan-janssen · 2023-07-03T05:45:35Z

Example code:

>>> import flux.job
>>> from pyiron_atomistics import Project
>>> pr = Project('test')
>>> job = pr.create.job.Lammps("lmp")
>>> job.structure = pr.create.structure.ase.bulk("Al", cubic=True)
>>> job.server.executor = flux.job.FluxExecutor()
>>> job.server.cores = 2
>>> fs = job.run()
>>> print(fs.done())
>>> print(fs.result())
>>> print(fs.done())
False
0
True

This pull request is based on:
#1151
#1152
#1153

jan-janssen · 2023-07-03T07:11:48Z

The demo is also available at https://github.com/pyiron-dev/pyiron-flux-demo

jan-janssen · 2023-07-03T13:39:43Z

Missing parts:

Disable run_mode flux on GenericMasters.
Test copy() function.

samwaseda

DocStrings

pyiron_base/jobs/job/runfunction.py

jan-janssen · 2023-07-03T21:56:43Z

Python concurrent.futures.Executor example:

from concurrent.futures import ProcessPoolExecutor
from pyiron_atomistics import Project

pr = Project('test')

job = pr.create.job.Lammps("lmp")
job.structure = pr.create.structure.ase.bulk("Al", cubic=True)
job.server.executor = ProcessPoolExecutor()
job.server.cores = 2
fs = job.run()

print(fs.done())
print(fs.result())
print(fs.done())

With the ProcessPoolExecutor it is possible to develop modern workflows locally before submitting them to an HPC with flux.

jan-janssen · 2023-07-03T22:15:21Z

Missing parts:

Disable run_mode flux on GenericMasters.

Test copy() function.

Done

jan-janssen · 2023-07-03T22:50:37Z

DocStrings

I added Docstrings, can you take another look?

pyiron_base/jobs/job/runfunction.py

samwaseda

I really like the introductory text in the function. Is there a possibility to extend the example a bit more so that the code becomes full? Especially it's not very clear right now where Executor() comes from. I would really start from import flux.job and put job.server.executor = flux.job.FluxExecutor().

Apart from Doc, is there actually a practical example which shows the capability of flux? I'm asking this because I'm not 100% convinced that we should introduce a new way of accessing job data - so far in pyiron we always worked with only job, and I'm wondering whether returning another object in job.run() is really necessary.

pyiron_base/jobs/job/runfunction.py

samwaseda · 2023-07-04T06:01:06Z

How can I actually unblock merging?

jan-janssen · 2023-07-04T14:21:22Z

How can I actually unblock merging?

I guess you can just create a new review and approve the merge request, that should unblock the merging.

pyiron_base/jobs/job/runfunction.py

jan-janssen · 2023-07-04T14:48:50Z

I really like the introductory text in the function. Is there a possibility to extend the example a bit more so that the code becomes full? Especially it's not very clear right now where Executor() comes from. I would really start from import flux.job and put job.server.executor = flux.job.FluxExecutor().

I added the import statement to clarify the usage.

Apart from Doc, is there actually a practical example which shows the capability of flux? I'm asking this because I'm not 100% convinced that we should introduce a new way of accessing job data - so far in pyiron we always worked with only job, and I'm wondering whether returning another object in job.run() is really necessary.

To develop Exascale Workflows with pyiron - which is an essential part of my PostDoc here in Los Alamos - we need asynchronous workflows. Fitting a machine learning potential is one example we are currently working on.

Basically, we start thousands of DFT calculation, one per GPU and once 100 DFT calculation are finished we start fitting the first potential. We then use the potential for active learning and generating new structures we want to include in the training set. During this initial fitting 90% of the DFT calculation are still running. Once more DFT calculation are completed we fit another potential and update the active learning to use the new potential. During the whole time the cyclic workflow of 1) DFT calculation, 2) Fitting and 3) Active Learning is executed asynchronously and in parallel. So at any point in time there are some DFT calculation running, a new potential is fitted to the already finished DFT calculation and a previous potential is used in an active learning procedure to identify new structures.

The standard interface in python for developing such asynchronous procedures is the concurrent.futures module. So this pull request aims to make pyiron compatible to the standard interface. In the future I want to connect this to the workflow developments of you and @liamhuber but that is a next step. For the moment I want to leave it to the user to manage the concurrent future objects, as I am not yet sure what is the best strategy. But to enable users to test this interface we need the integration inside pyiron_base as the goal is to use the current interface in production runs in Los Alamos within the next month.

samwaseda · 2023-07-04T15:31:04Z

Example code:

>>> import flux.job
>>> from pyiron_atomistics import Project
>>> pr = Project('test')
>>> job = pr.create.job.Lammps("lmp")
>>> job.structure = pr.create.structure.ase.bulk("Al", cubic=True)
>>> job.server.executor = flux.job.FluxExecutor()
>>> job.server.cores = 2
>>> fs = job.run()
>>> print(fs.done())
>>> print(fs.result())
>>> print(fs.done())
False
0
True

Actually I cannot find flux. Is it on conda?

jan-janssen · 2023-07-04T15:34:57Z

Actually I cannot find flux. Is it on conda?

Yes, it is flux-core on conda-forge.

samwaseda · 2023-07-04T15:41:11Z

I got this error message: FileNotFoundError: [Errno 2] Unable to connect to Flux: broker socket /run/flux/local was not found

jan-janssen · 2023-07-04T15:44:42Z

I got this error message: FileNotFoundError: [Errno 2] Unable to connect to Flux: broker socket /run/flux/local was not found

As illustrated in https://github.com/pyiron-dev/pyiron-flux-demo the first step is to call flux start this returns a shell and then in this shell the python process has to be executed. For a jupyter notebook you can use flux start jupyter notebook. In addition just like for any other queuing system, it is necessary to modify the run_*.sh scripts for the individual codes to use flux run just like they would use srun for SLURM.

samwaseda · 2023-07-04T16:56:48Z

To develop Exascale Workflows with pyiron - which is an essential part of my PostDoc here in Los Alamos - we need asynchronous workflows. Fitting a machine learning potential is one example we are currently working on.

I know what it should do, but the question is how it does it. So far I haven't been able to test it myself, but while in pyiron there were different types of run modes, there has never been anything more than just run. Now there's a returned object just for flux which follows a very different syntax after that. I just don't really understand: 1, how the user is supposed to know how it works; 2, why it should deviate so strongly from the unified pyiron syntax.

This being said, since I haven't been able to run it myself so far there might have been a fundamental misunderstanding from my side, so feel free to correct me if you think that's the case.

jan-janssen · 2023-07-04T17:08:09Z

Now there's a returned object just for flux which follows a very different syntax after that. I just don't really understand: 1, how the user is supposed to know how it works; 2, why it should deviate so strongly from the unified pyiron syntax.

Yes, asynchronous workflows are different from the synchronous workflows we developed before. Before each step in a workflow had a clear start and a clear end and only after one step ended a new step started. In asynchronous workflows the individual steps overlap. To address your concerns:

The implementation tries to follow the python standard library as closely as possible. https://docs.python.org/3/library/concurrent.futures.html This does not say it is easy to develop asynchronous workflows, but if you already develop asynchronous workflows with python, then pyiron feels natural - at least this is the intention. The second example I posted above as well as the unit tests use the concurrent.futures.ProcessPoolExecutor which is part of the standard library and should be easy to test on any system.
Yes, the syntax is different but only in a minimal way, as the run function returns a pointer which links to the execution of the current job. This pointer is only used to check if the calculation is finished. Maybe in future this is again going to be interested in the job.status. But at the moment to be able to use the standard python libraries for asynchronous programming we need direct access to the future object.

An alternative suggestion would be to attach the future object to job.server.future, then the run function does not return any output, if that is more desirable.

samwaseda · 2023-07-04T17:14:15Z

As illustrated in https://github.com/pyiron-dev/pyiron-flux-demo the first step is to call flux start this returns a shell and then in this shell the python process has to be executed. For a jupyter notebook you can use flux start jupyter notebook. In addition just like for any other queuing system, it is necessary to modify the run_*.sh scripts for the individual codes to use flux run just like they would use srun for SLURM.

I have serious concern about the utility of flux based on this example, because I don't see straightforwardly how to relate other pyiron jobs to this example. I'm sure that this is the right way, but I just don't think it's ripe for pyiron_base. Would it be imaginable to move it to pyiron_contrib? If not, I think this demo has to be extended to include steps to make other programs available using flux in a pyironic way (i.e. it shouldn't be a string inside a notebook that's exported and executed).

Maybe @pmrv or @liamhuber has an opinion?

niklassiemer · 2023-07-04T17:41:02Z

Indeed, the returned Future upon run is 'only' for status checking, isnt it? One concern is also that a 'standard' pyiron user forgets to assign the run() since we never did this... in that case it starts running but one could not get the Future object any more? Thus, I would like to store the future somewhere...

jan-janssen · 2023-07-04T17:49:24Z

it shouldn't be a string inside a notebook that's exported and executed.

If you start your notebook using flus start jupyter notebook then you can use all the code in the string directly in the jupyter notebook. The issue here is just on the mybinder side, because in mybinder I cannot manipulate the way how the jupyter notebook is started.

jan-janssen · 2023-07-04T17:50:06Z

Indeed, the returned Future upon run is 'only' for status checking, isnt it? One concern is also that a 'standard' pyiron user forgets to assign the run() since we never did this... in that case it starts running but one could not get the Future object any more? Thus, I would like to store the future somewhere...

As suggested above:

An alternative suggestion would be to attach the future object to job.server.future, then the run function does not return any output, if that is more desirable.

jan-janssen · 2023-07-04T18:11:16Z

Would it be imaginable to move it to pyiron_contrib?

I thought about this for a long time, but on the one hand this would result in a lot of duplicate code to have LAMMPS and VASP run with the new interface, as we would have to have modified job classes and so on. We had similar discussions about pyiron interface without files in https://github.com/pyiron/pyiron_contrib/tree/main/pyiron_contrib/nofiles which ultimately resulted in the release of https://github.com/pyiron/pyiron_lammps . While in this case having a standalone code really helped us to accelerate the development, in particular for large number of small LAMMPS simulations where pyiron struggles. The flux integration is primarily designed for the up-scaling of DFT calculation and orchestrating large numbers of DFT calculations in one allocation. So this directly addresses the primary application of pyiron_atomistics, so from my perspective the flux code has to be integrated in pyiron_base.

jan-janssen · 2023-07-04T18:38:02Z

To simplify the discussion, I created a separate pull request for the ProcessPoolExecutor interface:
#1155

This pull request is based on ProcessPoolExecutor interface and adds the flux interface.

samwaseda · 2023-07-04T18:44:44Z

If you start your notebook using flus start jupyter notebook then you can use all the code in the string directly in the jupyter notebook. The issue here is just on the mybinder side, because in mybinder I cannot manipulate the way how the jupyter notebook is started.

This sounds like "this is a demo that people shouldn't follow", which is contraction in itself... For me it doesn't have to be quite as interactive as binder, but we need something more enlightening that people can straightforwardly follow.

samwaseda · 2023-07-04T20:10:45Z

I thought about this for a long time, but on the one hand this would result in a lot of duplicate code to have LAMMPS and VASP run with the new interface, as we would have to have modified job classes and so on. We had similar discussions about pyiron interface without files in https://github.com/pyiron/pyiron_contrib/tree/main/pyiron_contrib/nofiles which ultimately resulted in the release of https://github.com/pyiron/pyiron_lammps . While in this case having a standalone code really helped us to accelerate the development, in particular for large number of small LAMMPS simulations where pyiron struggles.

If you are saying it should be included in pyiron_base because it's otherwise not efficient, I have to say it's not really a reason... Btw. I guess the last phrase requires an auxiliary phrase that might have been crucial.

The flux integration is primarily designed for the up-scaling of DFT calculation and orchestrating large numbers of DFT calculations in one allocation. So this directly addresses the primary application of pyiron_atomistics, so from my perspective the flux code has to be integrated in pyiron_base.

Again, I understand what it was made for, and I agree that it's the future (in the literal sense), but the question is how it should be implemented.

As suggested above:

An alternative suggestion would be to attach the future object to job.server.future, then the run function does not return any output, if that is more desirable.

This statement shows for me that the discussion has not converged yet.

Just to make sure, this PR has two major problems for me:

It breaks the generality of run
It is not clear how to make other jobs flux-compatible within the pyiron framework

For both points it's difficult for me to understand why it's happening this way and estimate what would be the steps to resolve the problems, but I have the feeling that we need a discussion on the more fundamental level. At least I'm sure we cannot maintain the code reliably in the current format.

jan-janssen · 2023-07-04T20:16:15Z

It breaks the generality of run

As suggested already above, I am fine with attaching the futures object to job.server.future if that resolves your concerns.

It is not clear how to make other jobs flux-compatible within the pyiron framework

I guess this is a general misunderstanding. Each job is flux-compatible, as flux can be used just like all the other queuing systems via pysqa. #1155 implements a new asynchronous job interface, based on the concurrent.futures.ProcessPoolExecutor class. This pull request extends #1155 by adding an asynchronous flux interface. So flux is basically the first queuing system which supports the asynchronous job interface as well as the traditional interface based on pysqa.

jan-janssen · 2023-07-04T21:04:23Z

I updated the flux mybinder example https://github.com/pyiron-dev/pyiron-flux-demo/tree/main - it now starts jupyter with flux start jupyter notebook. This is done by manipulating the kernel.json file.

liamhuber · 2023-07-04T21:07:24Z

I'm not going to be able to get to reviewing this PR today. Overall I thought #1155 was pretty good, although I may want to play around with some different combinations (like interactively opening a job with a flux executor set) to see how it goes. The baseline behaviour is ok by me though

It breaks the generality of run

As suggested already above, I am fine with attaching the futures object to job.server.future if that resolves your concerns.

This also comes up in #1155. I am ok with this solution; it may not be the final resting place of the futures object, but it will do for now and doesn't clutter up the main job namespace.

It is not clear how to make other jobs flux-compatible within the pyiron framework

I guess this is a general misunderstanding. Each job is flux-compatible, as flux can be used just like all the other queuing systems via pysqa. #1155 implements a new asynchronous job interface, based on the concurrent.futures.ProcessPoolExecutor class. This pull request extends #1155 by adding an asynchronous flux interface. So flux is basically the first queuing system which supports the asynchronous job interface as well as the traditional interface based on pysqa.

I think @samwaseda means master jobs? In which case #1155 certainly does not provide a universal interface and if this PR is similar his concern is totally valid!

I do think that at this stage it's fair enough to fail cleanly rather than support everything. Although obviously I would be much more comfortable with a "fail clean instead of working" attack in contrib than here. (I saw you're earlier comment on the topic of moving to contrib, and I'm not currently of the opinion that we should force the flux stuff over there, but that still doesn't mean I'm happy about this rough-around-the-edges stuff being in base...)

@jan-janssen, take a look at my comments in #1155. I'll come review this one tomorrow, but if it's similar to that then I guess I will be able to get behind the gist of it. I do share Sam's other concern:

This sounds like "this is a demo that people shouldn't follow", which is contraction in itself... For me it doesn't have to be quite as interactive as binder, but we need something more enlightening that people can straightforwardly follow.

The binder on https://github.com/pyiron-dev/pyiron-flux-demo full on fails, so what is even the point? But anyhow, while a somehow-functional demo is an important thing to have before merging this, this seems like a simple technical problem that I trust we'll find a way around --(EDIT: Super, binder issue got resolved while I was writing this. New version of the binder link is working fine for me!) more core is coming to consensus about the architecture of where to put the futures and how to handle master jobs.

samwaseda · 2023-07-04T21:12:09Z

I guess this is a general misunderstanding. Each job is flux-compatible, as flux can be used just like all the other queuing systems via pysqa. #1155 implements a new asynchronous job interface, based on the concurrent.futures.ProcessPoolExecutor class. This pull request extends #1155 by adding an asynchronous flux interface. So flux is basically the first queuing system which supports the asynchronous job interface as well as the traditional interface based on pysqa.

ok then I'm now a lot more comfortable, but with the demo before you updated it it was not very clear how to make that happen. I finish for today but I promise that I'll take a look as soon as I come back tomorrow morning.

liamhuber · 2023-07-05T16:32:50Z

This is built on top of #1155 and keeps merging those changes into it, but targets main; let's make it a draft until #1155 is merged.

samwaseda · 2023-07-06T04:53:18Z

One thing I don't understand is the coexistence of fs.done() and job status. Is there a reason that they cannot be merged except for historical reasons?

jan-janssen · 2023-07-06T12:13:36Z

One thing I don't understand is the coexistence of fs.done() and job status. Is there a reason that they cannot be merged except for historical reasons?

I agree, if I would start a new version of pyiron_base I would only use the concurrent.futures and remove the job status completely, but this change would be much more drastic, so for the moment we are going to have both concepts until we have move experience with the concurrent.futures.

samwaseda · 2023-07-06T12:24:42Z

I agree, if I would start a new version of pyiron_base I would only use the concurrent.futures and remove the job status completely, but this change would be much more drastic, so for the moment we are going to have both concepts until we have move experience with the concurrent.futures.

but isn’t this then the reason why it should be in contrib and not base? We are breaking the concept with this change

jan-janssen · 2023-07-06T12:39:57Z

but isn’t this then the reason why it should be in contrib and not base? We are breaking the concept with this change

I agree that this implementation adds an alternative to the previous implementation and I am convinced that if I would rewrite pyiron_base from scratch I would only use the concurrent.futures. Still I also think we both agree that is not what we want to do right now, as we have so many other developments in the pipeline regarding the workflows/ graphs and so on.

So the current pull requests introduce the functionality in a way that allows existing users who leverage pyiron for DFT calculation to already benefit from the new functionality, without interfering with existing users. As discussed in #1155 I want to add the integration with wait_for_jobs and remove_jobs so the only new command a user has to learn is setting job.server.executor while all the rest behaves exactly the same. A user can access the internal job.server.future interface but they do not have to.

niklassiemer · 2023-07-06T12:50:18Z

but isn’t this then the reason why it should be in contrib and not base? We are breaking the concept with this change

I agree that this implementation adds an alternative to the previous implementation and I am convinced that if I would rewrite pyiron_base from scratch I would only use the concurrent.futures. Still I also think we both agree that is not what we want to do right now, as we have so many other developments in the pipeline regarding the workflows/ graphs and so on.

So the current pull requests introduce the functionality in a way that allows existing users who leverage pyiron for DFT calculation to already benefit from the new functionality, without interfering with existing users. As discussed in #1155 I want to add the integration with wait_for_jobs and remove_jobs so the only new command a user has to learn is setting job.server.executor while all the rest behaves exactly the same. A user can access the internal job.server.future interface but they do not have to.

I like the general idea and I am fine with having this in pyiron_base if we agree on the interface, if it does not affect current users, if it will be integrated in the pyiron syntax as good as possible and if it raises clear error massages if something is not supported. I deem we are on a good way to achieve these goals :) Maybe, the job status should be able to react on a present Future object and as this instead of the file or the database if there is a Future? This would streamline the user experience.

@jan-janssen (in future perspective) why does the user need to set the executor? Could this not be handled by pyiron upon using a flux based queue type? Right now, it is of course fine that users have to do this to enable the new run functionality!

samwaseda · 2023-07-06T13:01:33Z

Now I thought about it again and think that this should also change status. Essentially at least for the time being it should work in the conventional way, meaning status=suspended should be there, and refresh_blablabla should update the statuses.

liamhuber · 2023-07-07T16:16:07Z

@jan-janssen it's a bit hard to say until the diff gets updated, but it looks like once #1155 gets merged the changes here will be extremely minimal and good-to-go.

It would be cool to have the infrastructure in our CI to have OS-dependent envs -- I can imagine an implementation of this and it should be possible -- then we could write linux-only tests for the flux stuff. But that's for the future and IMO now that #1155 has a bunch of tests we can get away without ~~extensive~~ any (for the foreseeable future at least) testing for flux here.

jan-janssen · 2023-07-07T16:38:15Z

@jan-janssen it's a bit hard to say until the diff gets updated, but it looks like once #1155 gets merged the changes here will be extremely minimal and good-to-go.

I just merged the changes into main and main into this pull request. It is basically just a single function that is added plus jinja2 as additional dependency, but we already use jinja2 in pysqa so it was already an implicit dependency before.

liamhuber

lgtm!

I guess some of the stuff about how to boot your jupyter notebook such that it fluxes might be nice to document, but also that's basically just flux documentation and not pyiron-flux documentation so I'm also ok with leaving it out. Now that this is on top of #1155 it's nice clean changes, so in addition to the working binder demo I'm totally happy with it.

jan-janssen · 2023-07-07T18:27:31Z

@samwaseda Do you want to take another look? I would be very happy to merge this before Monday.

niklassiemer

I think with change to store the Future in job.server.future the example code needs an edit. The code changes LGTM.

samwaseda

Yeah looks good to me now.

samwaseda self-requested a review July 3, 2023 18:10

samwaseda requested changes Jul 3, 2023

View reviewed changes

samwaseda reviewed Jul 3, 2023

View reviewed changes