Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #6990: Improve NERSC login/error handling #7055

Merged
merged 63 commits into from
Jun 10, 2024
Merged

Conversation

e-carlin
Copy link
Member

@e-carlin e-carlin commented May 7, 2024

This fixes a variety of open issues related to NERSC login.

Fix #6106: Add a login to sbatch button to the simulation status panel. The login modal will still appear automatically when long is needed but in the event the user closes it the start simulation button is replaced with a login button until they login.

Fix #5411: Improved display of sbatch connection errors. Now the errors are displayed within the login panel itself. Not in the stateAsText line and with blue/red text above the start simulation button. We still need to handle better text for when NERSC is down (#7041)

Fix #4625: Same as #5411.

Fix #7054: Check the computeJobSerial when handling responses from simulationFrame and only update the gui if the current computeJobSerial is the same as the one that was current when the requests were sent.

Fix #4367: Also fixed by #7054.

There are multiple paths through the code to handle. They are hard to test in an automated fashion so I've described them below:
no data happy path:

  • no existing data
  • user selects vagrant cluster
  • show login button
  • click login button
  • enter creds
  • agent starts
  • show start simulation button
    no data wrong creds:
  • no existing data
  • user select vagrant cluster
  • show login button
  • enter bad creds
  • see message
  • retry
  • repeat until creds are good
  • agent starts
  • show start simulation button
    no data cannot connect:
  • no existing data
  • user selects vagrant
  • show login button
  • enter creds
  • Get message about unable to connect
  • they can try again or leave modal
  • if they close modal the login button is presented again
    no data close modal
  • no existing data
  • user selects vagrant
  • click login button
  • user closes modal
  • we stay on the page showing the login button
    existing simulation frames happy path
  • user has a simulation with frames
  • agent is dead
  • they go to the page with the report
  • modal appears requesting login creds
  • enter creds (handling bad creds / cannot connect)
  • panel refreshes displaying the data
    existing simulation frames close modal:
  • user has a simulation with existing frames
  • agent is dead
  • they go to the page with the report
  • modal appears requesting login creds
  • they close the modal (not entering creds)
  • panel has an alert telling them they need to log in
  • login button is present in panel
    existing simulation frames, close modal, click login button:
  • user has a simulation with existing frames
  • agent is dead
  • they go to the page with the report
  • modal appears requesting login creds
  • hey close the modal (not entering creds)
  • panel has an alert telling them they need to log in and the button to log in is present
  • they click the login button and log in
  • frames are retreived automatically
    gui already thinks user is logged in (showing start sim button instead of login button) but agent is dead (not logged in) happy path:
  • user clicks start simulation
  • modal to log in is presented
  • they enter their credentials
  • if they enter invalid creds then they are presented with a warning in the login modal and a chance to re-enter creds
  • repeat until creds are valid
  • new simulation is started for them automatically
    gui already thinks user is logged in (showing start sim button instead of login button) but agent is dead (not logged in) close modal path:
  • user clicks start simulation
  • modal to log in is presented
  • they close the modal w/o providing valid creds
  • simulationStatusPanel shows job state missing with no data
  • button to log in is present

This fixes a variety of open issues related to NERSC login.

Fix #6106: Add a login to sbatch button to the simulation status
panel. The login modal will still appear automatically when long is
needed but in the event the user closes it the start simulation button
is replaced with a login button until they login.

Fix #5411: Improved display of sbatch connection errors. Now the
errors are displayed within the login panel itself. Not in the
stateAsText line and with blue/red text above the start simulation
button. We still need to handle better text for when NERSC is down
(#7041)

Fix #4625: Same as #5411.

Fix #7054: Check the computeJobSerial when handling responses from
simulationFrame and only update the gui if the current
computeJobSerial is the same as the one that was current when the
requests were sent.

Fix #4367: Also fixed by #7054.

There are multiple paths through the code to handle. They are hard to test
in an automated fashion so I've described them below:
**no data happy path**:
- no existing data
- user selects vagrant cluster
- show login button
- click login button
- enter creds
- agent starts
- show start simulation button
**no data wrong creds**:
- no existing data
- user select vagrant cluster
- show login button
- enter bad creds
- see message
- retry
- repeat until creds are good
- agent starts
- show start simulation button
**no data cannot connect**:
- no existing data
- user selects vagrant
- show login button
- enter creds
- Get message about unable to connect
- they can try again or leave modal
- if they close modal the login button is presented again
**no data close modal**
- no existing data
- user selects vagrant
- click login button
- user closes modal
- we stay on the page showing the login button
**existing simulation frames happy path**
- user has a simulation with frames
- agent is dead
- they go to the page with the report
- modal appears requesting login creds
-  enter creds (handling bad creds / cannot connect)
- panel refreshes displaying the data
**existing simulation frames close modal**:
- user has a simulation with existing frames
- agent is dead
- they go to the page with the report
- modal appears requesting login creds
- they close the modal (not entering creds)
- panel has an alert telling them they need to log in
- login button is present in panel
**existing simulation frames, close modal, click login button**:
- user has a simulation with existing frames
- agent is dead
- they go to the page with the report
- modal appears requesting login creds
- hey close the modal (not entering creds)
- panel has an alert telling them they need to log in and the button to log in is present
- they click the login button and log in
- frames are retreived automatically
**gui already thinks user is logged in (showing start sim button instead of login button) but agent is dead (not logged in) happy path**:
- user clicks start simulation
- modal to log in is presented
- they enter their credentials
- if they enter invalid creds then they are presented with a warning in the login modal and a chance to re-enter creds
- repeat until creds are valid
- new simulation is started for them automatically
**gui already thinks user is logged in (showing start sim button instead of login button) but agent is dead (not logged in) close modal path**:
- user clicks start simulation
- modal to log in is presented
- they close the modal w/o providing valid creds
- simulationStatusPanel shows job state missing with no data
- button to log in is present
Copy link
Member

@robnagler robnagler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read the comments about restartRunSimulation. I'm concerned that the state is denormalized and confusing. Consider rewriting as a single state machine.

sirepo/package_data/static/js/sirepo.js Outdated Show resolved Hide resolved
sirepo/package_data/static/js/sirepo.js Outdated Show resolved Hide resolved
sirepo/job_api.py Outdated Show resolved Hide resolved
sirepo/job_driver/sbatch.py Outdated Show resolved Hide resolved
sirepo/job_driver/sbatch.py Outdated Show resolved Hide resolved
sirepo/package_data/static/js/sirepo-components.js Outdated Show resolved Hide resolved
sirepo/package_data/static/js/sirepo-components.js Outdated Show resolved Hide resolved
sirepo/package_data/static/js/sirepo-components.js Outdated Show resolved Hide resolved
sirepo/package_data/static/js/sirepo-components.js Outdated Show resolved Hide resolved
}

$scope.$on('showSbatchLoginModal', function(event, broadcastArg) {
restartRunSimulation = broadcastArg.shouldRestartRunSimulation;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there would be better coupling if addRestartRunSimulation() (see below) looked at srExceptionParams (new name for broadcastArg) directly and the assignment wasn't here.

In looking at this code, there are too many booleans, and this should be a state machine handled central. restartRunSimulation is both global and a parameter passed around in the same controller. This is very confusing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I very begrudgingly introduced this bit of state. If you have a way to eliminate it please let me know. But, I can't think of a way to get rid of it. The problem is that the shouldRestartSimulation come through in the broadcast to showSbatchLoginModal. Somehow we need to communicate shouldRestartRunSimulation to el.on('hidden.bs.modal'...). The restartRunSimulation var was the only way I could figure out how to do this.

sirepo/job_supervisor.py Show resolved Hide resolved
@e-carlin
Copy link
Member Author

e-carlin commented May 9, 2024

After discussion with Rob and Paul I'm going to create a service that holds all of the state about sbatch login.

@e-carlin
Copy link
Member Author

Not yet a state machine but this pulls everything into a service f0c507b. This drastically cleaned up the code. Thanks for the suggestion @robnagler.

I think it is a good pattern to use across Sirepo going forward. As you said, we denormalize a lot of state in $scope variables across directives. Pulling into a service cuts way down on that which makes the directive controllers much simpler.

I was worried that the angular digest cycle wasn't going to pick up on changes to the state within the service. But, it had no problem (even removed an apply I had to use in the previous implementation).

I'll convert to a state machine next week.

@robnagler
Copy link
Member

  1. Changing job run mode after login results in an error: Change job run mode to sbatch, login, change back to smp, see error

1c548b7

  1. Changing job run modes while a request for status is happening causes an error when it comes back: Add await asyncio.sleep(10) to api_sbatchLoginStatus to simulate a slow request, change job run mode to vagrant cluster (request for status starts), change job run mode back to SMP, see error. This may just be another way to trigger the first error.

a95c12c

  1. Always prompted to log in: Discard change to example, change job run mode to sbatch, login (agent is started), refresh page (job run mode should go back to the default smp), change job run mode to vagrant, forced to log in again even though I have an agent

c72afa2

  1. Race condition between frame and login status: set await asyncio.sleep(10) in api_sbatchLoginStatus, have a simulation with a frame and job run mode set to sbatch, agent dead, refresh page, frame is requested and login status is requested, frame SRException pops up login modal (I think?), do login (quickly), wait, login status comes back, see error.

8f8bd97

  1. Frame is not requested after login: Have a simulation with a frame and job run mode sbatch, agent dead, refresh browser tab, modal appears to log in, log in, wait, frame is never shown.

007fe61

  1. Slow frame and starting job causes report not generated: put await asyncio.sleep(10) in api_simulationFrame, have job run mode set to sbatch, have a frame, have an alive agent, refresh browser tab so frame is requested, while waiting for it to come back click start new simulation, see error once it comes back.

#7099

  1. Radia changing run mode causes error: Radia > accelerator magnets > basic dipole, go to visualization tab, change job run mode to sbatch, see error

3ce486b

  1. This visualization tab in opal has error: opal > examples > awa photo injector, go to visualization tab, see error
  1. On any app clicking on the hamburger menu "simulations" button that takes one back to the list of simulations results in an error

6ad697a

This isn't related to this PR but noticed it while testing: set job run mode sbatch, start a simulation, kill agent (pkill -f job_agent), gui never reports that the simulation has stopped.

#7100 - does show up eventually (took two minutes in my case). the polling is once a minute so it could take a minute. not sure why it took two minutes

Also, while testing I noticed that Radia visualization tab has 2 simulation status panels and sbatc

This will work now, but it will be a little goofy if user modifies jobRunMode in one panel while the other is requesting status. Switching the mode back and forth would fix things. The proper solution requires state machine instances (doable since it's one value) to be bound to models, since that actually sits on the model. This would be a bit of work, but lets wait for a user complaint.

@robnagler
Copy link
Member

@e-carlin ready for another review.

Copy link
Member Author

@e-carlin e-carlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit. And found one bug:

Frame request loop: Have an sbatch simulation with a frame, agent dead, refresh page with frame, modal to login pops up, hit cancel, see repeated requests for frame and flashes of "requesting data"

sirepo/job_supervisor.py Show resolved Hide resolved
@robnagler
Copy link
Member

3dab62a adds resetSimulation and removed the requestFunction polling hack. @e-carlin Thanks for pointing out resetSimulation. That was the missing piece.

@e-carlin
Copy link
Member Author

Since I opened the pr I can't approve it but lgtm

@robnagler robnagler merged commit e792bb0 into master Jun 10, 2024
2 checks passed
@robnagler robnagler deleted the 6990-sbatch-login branch June 10, 2024 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants