-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OpenBMC] rflash unattended quits early before activation is complete, BMC response issue? #4550
Comments
I didn't capture the dump of the BMC before, and now updating BMC code cleared it. Looks like the earliest
So this is a little risky here, good thing we didn't reboot the BMC during Activation! I am thinking this error 500 caused xCAT to go ahead with the processing, but what caused it, I don't think we will have the logs on the BMC. IF we have this kind of BMC instability, we may NOT want to support this unattended flash ... |
@bybai Thinking about this some more, at a minimum we have to add additional checks so that before we reboot, we need to make sure the target firmware is "Active" to cover these kind of hiccups from the BMC. If not, then we just skip over those code pieces... Otherwise bad things could really happen if we reboot the wrong pieces in the middle of activation |
@whowutwut, let me try to reproduce it in mid05tor12cn05. |
The firmware team has some commits that protect against reboot when activating. But that is not yet in the drivers we have I think this is dangerous to run. (Especially reboot bmc when activating) We should protect against this from happening and then try to inject 500 return code to test out that we handle this ok. Don't reboot if the firmware is not in active state and fail out |
Hi @whowutwut ,
|
I try to reproduce the issue on mid05tor12cn05, Although I got the different result, I got firmware state active, but after bmcreboot, I got 500 and disconnect to BMC. Could your help look into the bmc of mid05tor12cn05; From log, in my case, rflash enter different code logic from yours. But if the firmware is activating and BMC response give none 200 response, rflash will hit issue the same with your hit. |
@whowutwut , I guess I hit CN05 bmc kernel panic you mentioned before. From log, BMC+PNOR are active, after BMC reboot, cannot connect BMC. The log is in commands.log
I will use another CN to debug this issue. |
Using build:
Trying to run this rflash unattended function...
So I didn't run with debug on, but the rflash command ended early, before the pieces went from Activation -> Active.
Let's see if we can figure out what triggered rflash to think that it was complete? In another window, I had a loop every 15 seconds to print the
rflash <> -l
output... I saw the following:But looking at
commands.log
, the rflash came back before we hit the 500 error.. Did this 500 error trigger xCAT to think that it was done monitoring the flash and go on?Eventually it does complete activation but by then, xCAT is no longer monitoring this task to do the reboots..
The text was updated successfully, but these errors were encountered: