-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reimplement the flux interface #1154
Conversation
The demo is also available at https://github.com/pyiron-dev/pyiron-flux-demo |
Missing parts:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DocStrings
Python
With the |
Done |
I added Docstrings, can you take another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the introductory text in the function. Is there a possibility to extend the example a bit more so that the code becomes full? Especially it's not very clear right now where Executor()
comes from. I would really start from import flux.job
and put job.server.executor = flux.job.FluxExecutor()
.
Apart from Doc, is there actually a practical example which shows the capability of flux
? I'm asking this because I'm not 100% convinced that we should introduce a new way of accessing job data - so far in pyiron we always worked with only job
, and I'm wondering whether returning another object in job.run()
is really necessary.
How can I actually unblock merging? |
I guess you can just create a new review and approve the merge request, that should unblock the merging. |
I added the import statement to clarify the usage.
To develop Exascale Workflows with pyiron - which is an essential part of my PostDoc here in Los Alamos - we need asynchronous workflows. Fitting a machine learning potential is one example we are currently working on. Basically, we start thousands of DFT calculation, one per GPU and once 100 DFT calculation are finished we start fitting the first potential. We then use the potential for active learning and generating new structures we want to include in the training set. During this initial fitting 90% of the DFT calculation are still running. Once more DFT calculation are completed we fit another potential and update the active learning to use the new potential. During the whole time the cyclic workflow of 1) DFT calculation, 2) Fitting and 3) Active Learning is executed asynchronously and in parallel. So at any point in time there are some DFT calculation running, a new potential is fitted to the already finished DFT calculation and a previous potential is used in an active learning procedure to identify new structures. The standard interface in python for developing such asynchronous procedures is the |
Actually I cannot find |
Yes, it is |
I got this error message: |
As illustrated in https://github.com/pyiron-dev/pyiron-flux-demo the first step is to call |
I know what it should do, but the question is how it does it. So far I haven't been able to test it myself, but while in pyiron there were different types of run modes, there has never been anything more than just This being said, since I haven't been able to run it myself so far there might have been a fundamental misunderstanding from my side, so feel free to correct me if you think that's the case. |
Yes, asynchronous workflows are different from the synchronous workflows we developed before. Before each step in a workflow had a clear start and a clear end and only after one step ended a new step started. In asynchronous workflows the individual steps overlap. To address your concerns:
An alternative suggestion would be to attach the future object to |
I have serious concern about the utility of Maybe @pmrv or @liamhuber has an opinion? |
Indeed, the returned Future upon run is 'only' for status checking, isnt it? One concern is also that a 'standard' pyiron user forgets to assign the run() since we never did this... in that case it starts running but one could not get the Future object any more? Thus, I would like to store the future somewhere... |
If you start your notebook using |
As suggested above:
|
I thought about this for a long time, but on the one hand this would result in a lot of duplicate code to have LAMMPS and VASP run with the new interface, as we would have to have modified job classes and so on. We had similar discussions about pyiron interface without files in https://github.com/pyiron/pyiron_contrib/tree/main/pyiron_contrib/nofiles which ultimately resulted in the release of https://github.com/pyiron/pyiron_lammps . While in this case having a standalone code really helped us to accelerate the development, in particular for large number of small LAMMPS simulations where pyiron struggles. The flux integration is primarily designed for the up-scaling of DFT calculation and orchestrating large numbers of DFT calculations in one allocation. So this directly addresses the primary application of |
To simplify the discussion, I created a separate pull request for the This pull request is based on |
This sounds like "this is a demo that people shouldn't follow", which is contraction in itself... For me it doesn't have to be quite as interactive as binder, but we need something more enlightening that people can straightforwardly follow. |
If you are saying it should be included in
Again, I understand what it was made for, and I agree that it's the future (in the literal sense), but the question is how it should be implemented.
This statement shows for me that the discussion has not converged yet. Just to make sure, this PR has two major problems for me:
For both points it's difficult for me to understand why it's happening this way and estimate what would be the steps to resolve the problems, but I have the feeling that we need a discussion on the more fundamental level. At least I'm sure we cannot maintain the code reliably in the current format. |
As suggested already above, I am fine with attaching the futures object to
I guess this is a general misunderstanding. Each job is flux-compatible, as flux can be used just like all the other queuing systems via pysqa. #1155 implements a new asynchronous job interface, based on the |
I updated the flux mybinder example https://github.com/pyiron-dev/pyiron-flux-demo/tree/main - it now starts jupyter with |
I'm not going to be able to get to reviewing this PR today. Overall I thought #1155 was pretty good, although I may want to play around with some different combinations (like interactively opening a job with a flux executor set) to see how it goes. The baseline behaviour is ok by me though
This also comes up in #1155. I am ok with this solution; it may not be the final resting place of the futures object, but it will do for now and doesn't clutter up the main job namespace.
I think @samwaseda means master jobs? In which case #1155 certainly does not provide a universal interface and if this PR is similar his concern is totally valid! I do think that at this stage it's fair enough to fail cleanly rather than support everything. Although obviously I would be much more comfortable with a "fail clean instead of working" attack in contrib than here. (I saw you're earlier comment on the topic of moving to contrib, and I'm not currently of the opinion that we should force the flux stuff over there, but that still doesn't mean I'm happy about this rough-around-the-edges stuff being in base...)
|
ok then I'm now a lot more comfortable, but with the demo before you updated it it was not very clear how to make that happen. I finish for today but I promise that I'll take a look as soon as I come back tomorrow morning. |
One thing I don't understand is the coexistence of |
I agree, if I would start a new version of |
but isn’t this then the reason why it should be in contrib and not base? We are breaking the concept with this change |
I agree that this implementation adds an alternative to the previous implementation and I am convinced that if I would rewrite So the current pull requests introduce the functionality in a way that allows existing users who leverage pyiron for DFT calculation to already benefit from the new functionality, without interfering with existing users. As discussed in #1155 I want to add the integration with |
I like the general idea and I am fine with having this in @jan-janssen (in future perspective) why does the user need to set the executor? Could this not be handled by pyiron upon using a flux based queue type? Right now, it is of course fine that users have to do this to enable the new run functionality! |
Now I thought about it again and think that this should also change |
@jan-janssen it's a bit hard to say until the diff gets updated, but it looks like once #1155 gets merged the changes here will be extremely minimal and good-to-go. It would be cool to have the infrastructure in our CI to have OS-dependent envs -- I can imagine an implementation of this and it should be possible -- then we could write linux-only tests for the flux stuff. But that's for the future and IMO now that #1155 has a bunch of tests we can get away without |
I just merged the changes into main and main into this pull request. It is basically just a single function that is added plus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
I guess some of the stuff about how to boot your jupyter notebook such that it fluxes might be nice to document, but also that's basically just flux documentation and not pyiron-flux documentation so I'm also ok with leaving it out. Now that this is on top of #1155 it's nice clean changes, so in addition to the working binder demo I'm totally happy with it.
@samwaseda Do you want to take another look? I would be very happy to merge this before Monday. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think with change to store the Future
in job.server.future
the example code needs an edit. The code changes LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah looks good to me now.
Example code:
This pull request is based on:
#1151
#1152
#1153