Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OpenBMC] rflash unattended quits early before activation is complete, BMC response issue? #4550

Closed
whowutwut opened this issue Dec 18, 2017 · 7 comments

Comments

@whowutwut
Copy link
Member

Using build:

[root@briggs01 910.1746.20171215a]# lsxcatd -v
Version 2.13.9 (git commit 113bab618f6f99aa6c17cabef54f2748d4bb11cf, built Fri Dec 15 07:16:01 EST 2017)

Trying to run this rflash unattended function...

[root@briggs01 910.1746.20171215a]# export XCAT_OPENBMC_DEVEL=YES
[root@briggs01 910.1746.20171215a]# rflash mid05tor12cn05 -l
mid05tor12cn05: ID       Purpose State      Version
mid05tor12cn05: -------------------------------------------------------
mid05tor12cn05: b3000fc6 BMC     Active(*)  ibm-v2.0-0-r13.3-0-gd624923
mid05tor12cn05: 608e9ebe BMC     Active     ibm-v2.0-0-r9-0-gecda565
mid05tor12cn05: 78abd177 Host    Active     IBM-witherspoon-ibm-OP9_v1.19_1.88
mid05tor12cn05: df4b8673 Host    Active(*)  IBM-witherspoon-ibm-OP9_v1.19_1.90
mid05tor12cn05:
[root@briggs01 910.1746.20171215a]# time rflash mid05tor12cn05 -d /mnt/xcat/iso/openbmc/910.1746.20171215a/
Attempting to upload /mnt/xcat/iso/openbmc/910.1746.20171215a/obmc-phosphor-image-witherspoon.ubi.mtd.tar and /mnt/xcat/iso/openbmc/910.1746.20171215a/witherspoon.pnor.squashfs.tar, please wait...
mid05tor12cn05: Firmware upload successful. Attempting to activate firmware: ibm-v2.0-0-r21-0-ge3cbb5a (ID: e3b2aba5)
mid05tor12cn05: Firmware upload successful. Attempting to activate firmware: IBM-witherspoon-ibm-OP9_v1.19_1.98 (ID: 20ec75fe)
mid05tor12cn05: Retry BMC state, wait for 15 seconds ...
mid05tor12cn05: BMC Ready
mid05tor12cn05: reset

real	3m56.285s
user	0m0.094s
sys	0m0.011s

So I didn't run with debug on, but the rflash command ended early, before the pieces went from Activation -> Active.

Let's see if we can figure out what triggered rflash to think that it was complete? In another window, I had a loop every 15 seconds to print the rflash <> -l output... I saw the following:

Mon Dec 18 11:08:23 EST 2017
mid05tor12cn05: ID       Purpose State      Version
mid05tor12cn05: -------------------------------------------------------
mid05tor12cn05: b3000fc6 BMC     Active(*)  ibm-v2.0-0-r13.3-0-gd624923
mid05tor12cn05: 608e9ebe BMC     Active     ibm-v2.0-0-r9-0-gecda565
mid05tor12cn05: e3b2aba5 BMC     Activating ibm-v2.0-0-r21-0-ge3cbb5a
mid05tor12cn05: df4b8673 Host    Active(*)  IBM-witherspoon-ibm-OP9_v1.19_1.90
mid05tor12cn05: 20ec75fe Host    Activating IBM-witherspoon-ibm-OP9_v1.19_1.98
mid05tor12cn05:
Mon Dec 18 11:08:29 EST 2017
======
Mon Dec 18 11:08:44 EST 2017
mid05tor12cn05: Error: [500] Login to BMC failed: 500 Can't connect to 172.11.139.5:443.
mid05tor12cn05: Error: BMC did not respond. Validate BMC configuration and retry the command.
Mon Dec 18 11:09:04 EST 2017
======
Mon Dec 18 11:09:19 EST 2017
mid05tor12cn05: ID       Purpose State      Version
mid05tor12cn05: -------------------------------------------------------
mid05tor12cn05: b3000fc6 BMC     Active(*)  ibm-v2.0-0-r13.3-0-gd624923
mid05tor12cn05: 608e9ebe BMC     Active     ibm-v2.0-0-r9-0-gecda565
mid05tor12cn05: e3b2aba5 BMC     Activating ibm-v2.0-0-r21-0-ge3cbb5a
mid05tor12cn05: df4b8673 Host    Active(*)  IBM-witherspoon-ibm-OP9_v1.19_1.90
mid05tor12cn05: 20ec75fe Host    Activating IBM-witherspoon-ibm-OP9_v1.19_1.98
mid05tor12cn05:
Mon Dec 18 11:09:25 EST 2017
======

But looking at commands.log , the rflash came back before we hit the 500 error.. Did this 500 error trigger xCAT to think that it was done monitoring the flash and go on?

mid05tor12cn05: Retry BMC state, wait for 15 seconds ...
mid05tor12cn05: BMC Ready
mid05tor12cn05: reset

Eventually it does complete activation but by then, xCAT is no longer monitoring this task to do the reboots..

====================================================
[Date]       2017-12-18 11:11:55
[ClientType] cli
[Request]    rflash mid05tor12cn05 -l
[Response]
mid05tor12cn05: ID       Purpose State      Version
mid05tor12cn05: -------------------------------------------------------
mid05tor12cn05: b3000fc6 BMC     Active(*)  ibm-v2.0-0-r13.3-0-gd624923
mid05tor12cn05: df4b8673 Host    Active(*)  IBM-witherspoon-ibm-OP9_v1.19_1.90
mid05tor12cn05: 608e9ebe BMC     Active     ibm-v2.0-0-r9-0-gecda565
mid05tor12cn05: e3b2aba5 BMC     Active(+)  ibm-v2.0-0-r21-0-ge3cbb5a
mid05tor12cn05: 20ec75fe Host    Active(+)  IBM-witherspoon-ibm-OP9_v1.19_1.98
mid05tor12cn05
[ElapsedTime] 3.727 s
@whowutwut
Copy link
Member Author

whowutwut commented Dec 18, 2017

I didn't capture the dump of the BMC before, and now updating BMC code cleared it. Looks like the earliest journalctl log did show a POWER off....

Dec 18 16:10:09 mid05tor12cn05-bmc system_manager.py[1246]: Running System State: HOST_POWERED_OFF
Dec 18 16:10:09 mid05tor12cn05-bmc systemd[1]: Started Disable VCS on a power off.
Dec 18 16:10:09 mid05tor12cn05-bmc systemd[1]: Starting Disable VCS on a power off...
Dec 18 16:10:09 mid05tor12cn05-bmc systemd[1]: Started Wait for Power0 to turn off.
Dec 18 16:10:08 mid05tor12cn05-bmc power_control.exe[1370]: PowerControl: pgood: 0, setting reset BMC_CP0_RESET_N to 0
Dec 18 16:10:07 mid05tor12cn05-bmc system_manager.py[1246]: Running System State: HOST_POWERING_OFF

So this is a little risky here, good thing we didn't reboot the BMC during Activation!

I am thinking this error 500 caused xCAT to go ahead with the processing, but what caused it, I don't think we will have the logs on the BMC. IF we have this kind of BMC instability, we may NOT want to support this unattended flash ...

@whowutwut
Copy link
Member Author

whowutwut commented Dec 18, 2017

@bybai Thinking about this some more, at a minimum we have to add additional checks so that before we reboot, we need to make sure the target firmware is "Active" to cover these kind of hiccups from the BMC. If not, then we just skip over those code pieces... Otherwise bad things could really happen if we reboot the wrong pieces in the middle of activation

@bybai
Copy link
Contributor

bybai commented Dec 19, 2017

@whowutwut, let me try to reproduce it in mid05tor12cn05.

@whowutwut
Copy link
Member Author

The firmware team has some commits that protect against reboot when activating. But that is not yet in the drivers we have

I think this is dangerous to run. (Especially reboot bmc when activating) We should protect against this from happening and then try to inject 500 return code to test out that we handle this ok. Don't reboot if the firmware is not in active state and fail out

@zet809 zet809 added the sprint1 label Dec 19, 2017
@zet809 zet809 added this to the 2.13.10 milestone Dec 19, 2017
@bybai
Copy link
Contributor

bybai commented Dec 19, 2017

Hi @whowutwut ,
I think you hit the issue when rflash checking BMC state, the machine status response is not OK, rflash enter "deal_with_response" function following line, that cause machine status enter "bmcreboot" :

2130         if (defined $status_info{RPOWER_BMC_STATUS_RESPONSE}{argv} and $status_info{RPOWER_BMC_STATUS_RESPONSE}{argv} =~ /bmc     state$/) {
2131             retry_check_times($node, "RPOWER_BMC_STATUS_REQUEST", "bmc_conn_check_times", $::BMC_CHECK_INTERVAL, $response->s     tatus_line);
2132             return;
2133         }

@bybai
Copy link
Contributor

bybai commented Dec 19, 2017

I try to reproduce the issue on mid05tor12cn05, Although I got the different result, I got firmware state active, but after bmcreboot, I got 500 and disconnect to BMC. Could your help look into the bmc of mid05tor12cn05;

From log, in my case, rflash enter different code logic from yours. But if the firmware is activating and BMC response give none 200 response, rflash will hit issue the same with your hit.

@bybai
Copy link
Contributor

bybai commented Dec 19, 2017

@whowutwut , I guess I hit CN05 bmc kernel panic you mentioned before. From log, BMC+PNOR are active, after BMC reboot, cannot connect BMC. The log is in commands.log

[Date]       2017-12-18 20:56:43
[ClientType] cli
[Request]    rflash mid05tor12cn05 -d '/mnt/xcat/iso/openbmc/910.1746.20171215a/' --verbose

I will use another CN to debug this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants