Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplement the flux interface #1154

Merged
merged 4 commits into from
Jul 8, 2023
Merged

Reimplement the flux interface #1154

merged 4 commits into from
Jul 8, 2023

Conversation

jan-janssen
Copy link
Member

@jan-janssen jan-janssen commented Jul 3, 2023

Example code:

>>> import flux.job
>>> from pyiron_atomistics import Project
>>> pr = Project('test')
>>> job = pr.create.job.Lammps("lmp")
>>> job.structure = pr.create.structure.ase.bulk("Al", cubic=True)
>>> job.server.executor = flux.job.FluxExecutor()
>>> job.server.cores = 2
>>> fs = job.run()
>>> print(fs.done())
>>> print(fs.result())
>>> print(fs.done())
False
0
True

This pull request is based on:
#1151
#1152
#1153

@jan-janssen
Copy link
Member Author

The demo is also available at https://github.com/pyiron-dev/pyiron-flux-demo

@jan-janssen
Copy link
Member Author

Missing parts:

  • Disable run_mode flux on GenericMasters.
  • Test copy() function.

@samwaseda samwaseda self-requested a review July 3, 2023 18:10
Copy link
Member

@samwaseda samwaseda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DocStrings

@jan-janssen
Copy link
Member Author

jan-janssen commented Jul 3, 2023

Python concurrent.futures.Executor example:

from concurrent.futures import ProcessPoolExecutor
from pyiron_atomistics import Project

pr = Project('test')

job = pr.create.job.Lammps("lmp")
job.structure = pr.create.structure.ase.bulk("Al", cubic=True)
job.server.executor = ProcessPoolExecutor()
job.server.cores = 2
fs = job.run()

print(fs.done())
print(fs.result())
print(fs.done())

With the ProcessPoolExecutor it is possible to develop modern workflows locally before submitting them to an HPC with flux.

@jan-janssen
Copy link
Member Author

Missing parts:

  • Disable run_mode flux on GenericMasters.
  • Test copy() function.

Done

@jan-janssen
Copy link
Member Author

DocStrings

I added Docstrings, can you take another look?

Copy link
Member

@samwaseda samwaseda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the introductory text in the function. Is there a possibility to extend the example a bit more so that the code becomes full? Especially it's not very clear right now where Executor() comes from. I would really start from import flux.job and put job.server.executor = flux.job.FluxExecutor().

Apart from Doc, is there actually a practical example which shows the capability of flux? I'm asking this because I'm not 100% convinced that we should introduce a new way of accessing job data - so far in pyiron we always worked with only job, and I'm wondering whether returning another object in job.run() is really necessary.

pyiron_base/jobs/job/runfunction.py Outdated Show resolved Hide resolved
pyiron_base/jobs/job/runfunction.py Outdated Show resolved Hide resolved
@samwaseda
Copy link
Member

How can I actually unblock merging?

@jan-janssen
Copy link
Member Author

How can I actually unblock merging?

I guess you can just create a new review and approve the merge request, that should unblock the merging.

@jan-janssen
Copy link
Member Author

I really like the introductory text in the function. Is there a possibility to extend the example a bit more so that the code becomes full? Especially it's not very clear right now where Executor() comes from. I would really start from import flux.job and put job.server.executor = flux.job.FluxExecutor().

I added the import statement to clarify the usage.

Apart from Doc, is there actually a practical example which shows the capability of flux? I'm asking this because I'm not 100% convinced that we should introduce a new way of accessing job data - so far in pyiron we always worked with only job, and I'm wondering whether returning another object in job.run() is really necessary.

To develop Exascale Workflows with pyiron - which is an essential part of my PostDoc here in Los Alamos - we need asynchronous workflows. Fitting a machine learning potential is one example we are currently working on.

Basically, we start thousands of DFT calculation, one per GPU and once 100 DFT calculation are finished we start fitting the first potential. We then use the potential for active learning and generating new structures we want to include in the training set. During this initial fitting 90% of the DFT calculation are still running. Once more DFT calculation are completed we fit another potential and update the active learning to use the new potential. During the whole time the cyclic workflow of 1) DFT calculation, 2) Fitting and 3) Active Learning is executed asynchronously and in parallel. So at any point in time there are some DFT calculation running, a new potential is fitted to the already finished DFT calculation and a previous potential is used in an active learning procedure to identify new structures.

The standard interface in python for developing such asynchronous procedures is the concurrent.futures module. So this pull request aims to make pyiron compatible to the standard interface. In the future I want to connect this to the workflow developments of you and @liamhuber but that is a next step. For the moment I want to leave it to the user to manage the concurrent future objects, as I am not yet sure what is the best strategy. But to enable users to test this interface we need the integration inside pyiron_base as the goal is to use the current interface in production runs in Los Alamos within the next month.

@jan-janssen jan-janssen added the format_black reformat the code using the black standard label Jul 4, 2023
@samwaseda
Copy link
Member

Example code:

>>> import flux.job
>>> from pyiron_atomistics import Project
>>> pr = Project('test')
>>> job = pr.create.job.Lammps("lmp")
>>> job.structure = pr.create.structure.ase.bulk("Al", cubic=True)
>>> job.server.executor = flux.job.FluxExecutor()
>>> job.server.cores = 2
>>> fs = job.run()
>>> print(fs.done())
>>> print(fs.result())
>>> print(fs.done())
False
0
True

Actually I cannot find flux. Is it on conda?

@jan-janssen
Copy link
Member Author

Actually I cannot find flux. Is it on conda?

Yes, it is flux-core on conda-forge.

@samwaseda
Copy link
Member

I got this error message: FileNotFoundError: [Errno 2] Unable to connect to Flux: broker socket /run/flux/local was not found

@jan-janssen
Copy link
Member Author

I got this error message: FileNotFoundError: [Errno 2] Unable to connect to Flux: broker socket /run/flux/local was not found

As illustrated in https://github.com/pyiron-dev/pyiron-flux-demo the first step is to call flux start this returns a shell and then in this shell the python process has to be executed. For a jupyter notebook you can use flux start jupyter notebook. In addition just like for any other queuing system, it is necessary to modify the run_*.sh scripts for the individual codes to use flux run just like they would use srun for SLURM.

@samwaseda
Copy link
Member

To develop Exascale Workflows with pyiron - which is an essential part of my PostDoc here in Los Alamos - we need asynchronous workflows. Fitting a machine learning potential is one example we are currently working on.

I know what it should do, but the question is how it does it. So far I haven't been able to test it myself, but while in pyiron there were different types of run modes, there has never been anything more than just run. Now there's a returned object just for flux which follows a very different syntax after that. I just don't really understand: 1, how the user is supposed to know how it works; 2, why it should deviate so strongly from the unified pyiron syntax.

This being said, since I haven't been able to run it myself so far there might have been a fundamental misunderstanding from my side, so feel free to correct me if you think that's the case.

@jan-janssen
Copy link
Member Author

Now there's a returned object just for flux which follows a very different syntax after that. I just don't really understand: 1, how the user is supposed to know how it works; 2, why it should deviate so strongly from the unified pyiron syntax.

Yes, asynchronous workflows are different from the synchronous workflows we developed before. Before each step in a workflow had a clear start and a clear end and only after one step ended a new step started. In asynchronous workflows the individual steps overlap. To address your concerns:

  1. The implementation tries to follow the python standard library as closely as possible. https://docs.python.org/3/library/concurrent.futures.html This does not say it is easy to develop asynchronous workflows, but if you already develop asynchronous workflows with python, then pyiron feels natural - at least this is the intention. The second example I posted above as well as the unit tests use the concurrent.futures.ProcessPoolExecutor which is part of the standard library and should be easy to test on any system.
  2. Yes, the syntax is different but only in a minimal way, as the run function returns a pointer which links to the execution of the current job. This pointer is only used to check if the calculation is finished. Maybe in future this is again going to be interested in the job.status. But at the moment to be able to use the standard python libraries for asynchronous programming we need direct access to the future object.

An alternative suggestion would be to attach the future object to job.server.future, then the run function does not return any output, if that is more desirable.

@samwaseda
Copy link
Member

As illustrated in https://github.com/pyiron-dev/pyiron-flux-demo the first step is to call flux start this returns a shell and then in this shell the python process has to be executed. For a jupyter notebook you can use flux start jupyter notebook. In addition just like for any other queuing system, it is necessary to modify the run_*.sh scripts for the individual codes to use flux run just like they would use srun for SLURM.

I have serious concern about the utility of flux based on this example, because I don't see straightforwardly how to relate other pyiron jobs to this example. I'm sure that this is the right way, but I just don't think it's ripe for pyiron_base. Would it be imaginable to move it to pyiron_contrib? If not, I think this demo has to be extended to include steps to make other programs available using flux in a pyironic way (i.e. it shouldn't be a string inside a notebook that's exported and executed).

Maybe @pmrv or @liamhuber has an opinion?

@niklassiemer
Copy link
Member

Indeed, the returned Future upon run is 'only' for status checking, isnt it? One concern is also that a 'standard' pyiron user forgets to assign the run() since we never did this... in that case it starts running but one could not get the Future object any more? Thus, I would like to store the future somewhere...

@jan-janssen
Copy link
Member Author

it shouldn't be a string inside a notebook that's exported and executed.

If you start your notebook using flus start jupyter notebook then you can use all the code in the string directly in the jupyter notebook. The issue here is just on the mybinder side, because in mybinder I cannot manipulate the way how the jupyter notebook is started.

@jan-janssen
Copy link
Member Author

Indeed, the returned Future upon run is 'only' for status checking, isnt it? One concern is also that a 'standard' pyiron user forgets to assign the run() since we never did this... in that case it starts running but one could not get the Future object any more? Thus, I would like to store the future somewhere...

As suggested above:

An alternative suggestion would be to attach the future object to job.server.future, then the run function does not return any output, if that is more desirable.

@jan-janssen
Copy link
Member Author

Would it be imaginable to move it to pyiron_contrib?

I thought about this for a long time, but on the one hand this would result in a lot of duplicate code to have LAMMPS and VASP run with the new interface, as we would have to have modified job classes and so on. We had similar discussions about pyiron interface without files in https://github.com/pyiron/pyiron_contrib/tree/main/pyiron_contrib/nofiles which ultimately resulted in the release of https://github.com/pyiron/pyiron_lammps . While in this case having a standalone code really helped us to accelerate the development, in particular for large number of small LAMMPS simulations where pyiron struggles. The flux integration is primarily designed for the up-scaling of DFT calculation and orchestrating large numbers of DFT calculations in one allocation. So this directly addresses the primary application of pyiron_atomistics, so from my perspective the flux code has to be integrated in pyiron_base.

@jan-janssen
Copy link
Member Author

To simplify the discussion, I created a separate pull request for the ProcessPoolExecutor interface:
#1155

This pull request is based on ProcessPoolExecutor interface and adds the flux interface.

@samwaseda
Copy link
Member

If you start your notebook using flus start jupyter notebook then you can use all the code in the string directly in the jupyter notebook. The issue here is just on the mybinder side, because in mybinder I cannot manipulate the way how the jupyter notebook is started.

This sounds like "this is a demo that people shouldn't follow", which is contraction in itself... For me it doesn't have to be quite as interactive as binder, but we need something more enlightening that people can straightforwardly follow.

@samwaseda
Copy link
Member

I thought about this for a long time, but on the one hand this would result in a lot of duplicate code to have LAMMPS and VASP run with the new interface, as we would have to have modified job classes and so on. We had similar discussions about pyiron interface without files in https://github.com/pyiron/pyiron_contrib/tree/main/pyiron_contrib/nofiles which ultimately resulted in the release of https://github.com/pyiron/pyiron_lammps . While in this case having a standalone code really helped us to accelerate the development, in particular for large number of small LAMMPS simulations where pyiron struggles.

If you are saying it should be included in pyiron_base because it's otherwise not efficient, I have to say it's not really a reason... Btw. I guess the last phrase requires an auxiliary phrase that might have been crucial.

The flux integration is primarily designed for the up-scaling of DFT calculation and orchestrating large numbers of DFT calculations in one allocation. So this directly addresses the primary application of pyiron_atomistics, so from my perspective the flux code has to be integrated in pyiron_base.

Again, I understand what it was made for, and I agree that it's the future (in the literal sense), but the question is how it should be implemented.

As suggested above:

An alternative suggestion would be to attach the future object to job.server.future, then the run function does not return any output, if that is more desirable.

This statement shows for me that the discussion has not converged yet.

Just to make sure, this PR has two major problems for me:

  • It breaks the generality of run
  • It is not clear how to make other jobs flux-compatible within the pyiron framework

For both points it's difficult for me to understand why it's happening this way and estimate what would be the steps to resolve the problems, but I have the feeling that we need a discussion on the more fundamental level. At least I'm sure we cannot maintain the code reliably in the current format.

@jan-janssen
Copy link
Member Author

jan-janssen commented Jul 4, 2023

  • It breaks the generality of run

As suggested already above, I am fine with attaching the futures object to job.server.future if that resolves your concerns.

  • It is not clear how to make other jobs flux-compatible within the pyiron framework

I guess this is a general misunderstanding. Each job is flux-compatible, as flux can be used just like all the other queuing systems via pysqa. #1155 implements a new asynchronous job interface, based on the concurrent.futures.ProcessPoolExecutor class. This pull request extends #1155 by adding an asynchronous flux interface. So flux is basically the first queuing system which supports the asynchronous job interface as well as the traditional interface based on pysqa.

@jan-janssen
Copy link
Member Author

I updated the flux mybinder example https://github.com/pyiron-dev/pyiron-flux-demo/tree/main - it now starts jupyter with flux start jupyter notebook. This is done by manipulating the kernel.json file.

@liamhuber
Copy link
Member

liamhuber commented Jul 4, 2023

I'm not going to be able to get to reviewing this PR today. Overall I thought #1155 was pretty good, although I may want to play around with some different combinations (like interactively opening a job with a flux executor set) to see how it goes. The baseline behaviour is ok by me though

  • It breaks the generality of run

As suggested already above, I am fine with attaching the futures object to job.server.future if that resolves your concerns.

This also comes up in #1155. I am ok with this solution; it may not be the final resting place of the futures object, but it will do for now and doesn't clutter up the main job namespace.

  • It is not clear how to make other jobs flux-compatible within the pyiron framework

I guess this is a general misunderstanding. Each job is flux-compatible, as flux can be used just like all the other queuing systems via pysqa. #1155 implements a new asynchronous job interface, based on the concurrent.futures.ProcessPoolExecutor class. This pull request extends #1155 by adding an asynchronous flux interface. So flux is basically the first queuing system which supports the asynchronous job interface as well as the traditional interface based on pysqa.

I think @samwaseda means master jobs? In which case #1155 certainly does not provide a universal interface and if this PR is similar his concern is totally valid!

I do think that at this stage it's fair enough to fail cleanly rather than support everything. Although obviously I would be much more comfortable with a "fail clean instead of working" attack in contrib than here. (I saw you're earlier comment on the topic of moving to contrib, and I'm not currently of the opinion that we should force the flux stuff over there, but that still doesn't mean I'm happy about this rough-around-the-edges stuff being in base...)

@jan-janssen, take a look at my comments in #1155. I'll come review this one tomorrow, but if it's similar to that then I guess I will be able to get behind the gist of it. I do share Sam's other concern:

This sounds like "this is a demo that people shouldn't follow", which is contraction in itself... For me it doesn't have to be quite as interactive as binder, but we need something more enlightening that people can straightforwardly follow.

The binder on https://github.com/pyiron-dev/pyiron-flux-demo full on fails, so what is even the point? But anyhow, while a somehow-functional demo is an important thing to have before merging this, this seems like a simple technical problem that I trust we'll find a way around --(EDIT: Super, binder issue got resolved while I was writing this. New version of the binder link is working fine for me!) more core is coming to consensus about the architecture of where to put the futures and how to handle master jobs.

@samwaseda
Copy link
Member

I guess this is a general misunderstanding. Each job is flux-compatible, as flux can be used just like all the other queuing systems via pysqa. #1155 implements a new asynchronous job interface, based on the concurrent.futures.ProcessPoolExecutor class. This pull request extends #1155 by adding an asynchronous flux interface. So flux is basically the first queuing system which supports the asynchronous job interface as well as the traditional interface based on pysqa.

ok then I'm now a lot more comfortable, but with the demo before you updated it it was not very clear how to make that happen. I finish for today but I promise that I'll take a look as soon as I come back tomorrow morning.

@liamhuber liamhuber marked this pull request as draft July 5, 2023 16:32
@liamhuber
Copy link
Member

This is built on top of #1155 and keeps merging those changes into it, but targets main; let's make it a draft until #1155 is merged.

@samwaseda
Copy link
Member

One thing I don't understand is the coexistence of fs.done() and job status. Is there a reason that they cannot be merged except for historical reasons?

@jan-janssen
Copy link
Member Author

One thing I don't understand is the coexistence of fs.done() and job status. Is there a reason that they cannot be merged except for historical reasons?

I agree, if I would start a new version of pyiron_base I would only use the concurrent.futures and remove the job status completely, but this change would be much more drastic, so for the moment we are going to have both concepts until we have move experience with the concurrent.futures.

@samwaseda
Copy link
Member

I agree, if I would start a new version of pyiron_base I would only use the concurrent.futures and remove the job status completely, but this change would be much more drastic, so for the moment we are going to have both concepts until we have move experience with the concurrent.futures.

but isn’t this then the reason why it should be in contrib and not base? We are breaking the concept with this change

@jan-janssen
Copy link
Member Author

but isn’t this then the reason why it should be in contrib and not base? We are breaking the concept with this change

I agree that this implementation adds an alternative to the previous implementation and I am convinced that if I would rewrite pyiron_base from scratch I would only use the concurrent.futures. Still I also think we both agree that is not what we want to do right now, as we have so many other developments in the pipeline regarding the workflows/ graphs and so on.

So the current pull requests introduce the functionality in a way that allows existing users who leverage pyiron for DFT calculation to already benefit from the new functionality, without interfering with existing users. As discussed in #1155 I want to add the integration with wait_for_jobs and remove_jobs so the only new command a user has to learn is setting job.server.executor while all the rest behaves exactly the same. A user can access the internal job.server.future interface but they do not have to.

@niklassiemer
Copy link
Member

but isn’t this then the reason why it should be in contrib and not base? We are breaking the concept with this change

I agree that this implementation adds an alternative to the previous implementation and I am convinced that if I would rewrite pyiron_base from scratch I would only use the concurrent.futures. Still I also think we both agree that is not what we want to do right now, as we have so many other developments in the pipeline regarding the workflows/ graphs and so on.

So the current pull requests introduce the functionality in a way that allows existing users who leverage pyiron for DFT calculation to already benefit from the new functionality, without interfering with existing users. As discussed in #1155 I want to add the integration with wait_for_jobs and remove_jobs so the only new command a user has to learn is setting job.server.executor while all the rest behaves exactly the same. A user can access the internal job.server.future interface but they do not have to.

I like the general idea and I am fine with having this in pyiron_base if we agree on the interface, if it does not affect current users, if it will be integrated in the pyiron syntax as good as possible and if it raises clear error massages if something is not supported. I deem we are on a good way to achieve these goals :) Maybe, the job status should be able to react on a present Future object and as this instead of the file or the database if there is a Future? This would streamline the user experience.

@jan-janssen (in future perspective) why does the user need to set the executor? Could this not be handled by pyiron upon using a flux based queue type? Right now, it is of course fine that users have to do this to enable the new run functionality!

@samwaseda
Copy link
Member

Now I thought about it again and think that this should also change status. Essentially at least for the time being it should work in the conventional way, meaning status=suspended should be there, and refresh_blablabla should update the statuses.

@liamhuber
Copy link
Member

liamhuber commented Jul 7, 2023

@jan-janssen it's a bit hard to say until the diff gets updated, but it looks like once #1155 gets merged the changes here will be extremely minimal and good-to-go.

It would be cool to have the infrastructure in our CI to have OS-dependent envs -- I can imagine an implementation of this and it should be possible -- then we could write linux-only tests for the flux stuff. But that's for the future and IMO now that #1155 has a bunch of tests we can get away without extensive any (for the foreseeable future at least) testing for flux here.

@jan-janssen jan-janssen marked this pull request as ready for review July 7, 2023 16:36
@jan-janssen
Copy link
Member Author

@jan-janssen it's a bit hard to say until the diff gets updated, but it looks like once #1155 gets merged the changes here will be extremely minimal and good-to-go.

I just merged the changes into main and main into this pull request. It is basically just a single function that is added plus jinja2 as additional dependency, but we already use jinja2 in pysqa so it was already an implicit dependency before.

Copy link
Member

@liamhuber liamhuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

I guess some of the stuff about how to boot your jupyter notebook such that it fluxes might be nice to document, but also that's basically just flux documentation and not pyiron-flux documentation so I'm also ok with leaving it out. Now that this is on top of #1155 it's nice clean changes, so in addition to the working binder demo I'm totally happy with it.

@jan-janssen
Copy link
Member Author

@samwaseda Do you want to take another look? I would be very happy to merge this before Monday.

Copy link
Member

@niklassiemer niklassiemer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with change to store the Future in job.server.future the example code needs an edit. The code changes LGTM.

Copy link
Member

@samwaseda samwaseda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah looks good to me now.

@samwaseda samwaseda merged commit 74b82d1 into main Jul 8, 2023
23 checks passed
@delete-merged-branch delete-merged-branch bot deleted the flux branch July 8, 2023 07:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
format_black reformat the code using the black standard
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants