Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UI thinks a node is still deploying, even after the job has disappeared. #2443

Closed
slaperche-scality opened this issue Apr 21, 2020 · 1 comment
Assignees
Labels
complexity:medium Something that requires one or few days to fix kind:bug Something isn't working topic:ui UI-related issues

Comments

@slaperche-scality
Copy link
Contributor

slaperche-scality commented Apr 21, 2020

Component:

UI

What happened:

Due to some issue with the OneClick, the SSH key was unusable.
Thus, when I tried to expand the cluster, the deployment was stuck (because Salt was prompting for a yes/no for the SSH key during the ping I guess).
After manually fixing the SSH key, I've run the expansion from the CLI and it worked.

But, the UI still believe that the node is deploying (even though it got the right status as Ready).

Screenshot_2020-04-21 MetalK8s Platform

After investigating a with @ChengYanJin it appears that the UI is still polling for the Salt JID that was stuck and think it hasn't completed yet:

job

But this job, no longer exists:

[root@hd-cluster-bootstrap /]# salt-run jobs.print_job 20200421071144947483
20200421071144947483:
    ----------
    Error:
        Cannot contact returner or no job with this jid
    Result:
        ----------
    StartTime:
        2020, Apr 21 07:11:44.947483

And indeed, the UI get the same answer:

salt

But isn't processing it correctly.
I think the error either lies in refreshJobStatus (with its TODO: error handling? 😂 ) or inside getJobStatusFromPrintJob which isn't looking for an Error key in the response.

Note that we should handle this case of JID not found because it can also happen if the salt-master crash or is restarted (JID are not persistent).

What was expected:

The node shouldn't appear as Deploying…

When the JID no longer exists we should at least warn the user something went wrong (and maybe restart the job automatically, though I'm not sure about this behavior…)

Steps to reproduce

I guess you can try:

  • use a "bad" or unknown SSH key when doing cluster expansion from UI
  • fix the SSH key
  • run the deployment from the CLI
  • go check the UI

Resolution proposal:
We should handle when there is an error occur( not indicate in the response, but in the JSON), set the job completed to true and pop up a notification with the error message)
Otherwise the job is stuck there and will never finish.

export function getJobStatusFromPrintJob(result, jid) {
  let status = {
    completed: false,
  };
  const job = result.return[0][jid];
  if (job && Object.keys(job['Result']).length) {
    status.completed = true;
    const returner = Object.values(job['Result'])[0].return;
    status.success = returner.success;
    if (!status.success) {
      status = { ...status, ...parseJobError(returner) };
    }
  }
  return status;
}
@slaperche-scality slaperche-scality added the topic:ui UI-related issues label Apr 21, 2020
@ChengYanJin ChengYanJin self-assigned this Apr 22, 2020
@thomasdanan thomasdanan added the kind:bug Something isn't working label Apr 22, 2020
ChengYanJin added a commit that referenced this issue Apr 24, 2020
ChengYanJin added a commit that referenced this issue Apr 24, 2020
ChengYanJin added a commit that referenced this issue Apr 24, 2020
Since we are not sure how to reproduce the error circumstances,
we will just display everything inside the result.error

Refs: #2443
@ChengYanJin ChengYanJin added the complexity:medium Something that requires one or few days to fix label Apr 24, 2020
ChengYanJin added a commit that referenced this issue Apr 24, 2020
ChengYanJin added a commit that referenced this issue Apr 24, 2020
Since we are not sure how to reproduce the error circumstances,
we will just display everything inside the result.error

Refs: #2443
ChengYanJin added a commit that referenced this issue Apr 24, 2020
@ChengYanJin
Copy link
Contributor

merged in #2458 (dev/2.6) and #2475(dev/2.5)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
complexity:medium Something that requires one or few days to fix kind:bug Something isn't working topic:ui UI-related issues
Projects
None yet
Development

No branches or pull requests

3 participants