Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Master branch starting on build 353 BRCM DUT suffers SYNCD get switch_type failure resulting to Orchagent crash #5045

Closed
gechiang opened this issue Jul 27, 2020 · 2 comments · Fixed by #5052

Comments

@gechiang
Copy link
Collaborator

Starting with Master branch build 353 Orchagent crashes for BRCM DUTs.
If you look into the SYS log you will find the following:
...
Jul 26 22:32:45.980049 str-s6000-acs-8 NOTICE syncd#syncd: :- helperSaveDiscoveredObjectsToRedis: save discovered objects to redis took 2.731556 sec
Jul 26 22:32:45.984351 str-s6000-acs-8 WARNING syncd#syncd: :- helperGetSwitchAttrOid: failed to get SAI_SWITCH_ATTR_ECMP_HASH: SAI_STATUS_NOT_SUPPORTED
Jul 26 22:32:45.984351 str-s6000-acs-8 WARNING syncd#syncd: :- helperGetSwitchAttrOid: failed to get SAI_SWITCH_ATTR_LAG_HASH: SAI_STATUS_NOT_SUPPORTED
Jul 26 22:32:45.984351 str-s6000-acs-8 ERR syncd#syncd: :- getSwitchType: failed to get switch type
Jul 26 22:32:45.984351 str-s6000-acs-8 NOTICE syncd#syncd: :- SaiSwitch: constructor took 3.042757 sec
Jul 26 22:32:45.987625 str-s6000-acs-8 ERR syncd#syncd: :- run: Runtime error: :- getSwitchType: failed to get switch type
Jul 26 22:32:45.987625 str-s6000-acs-8 NOTICE syncd#syncd: :- sendShutdownRequest: sending switch_shutdown_request notification to OA for switch: oid:0x0
Jul 26 22:32:45.988511 str-s6000-acs-8 NOTICE syncd#syncd: :- sendShutdownRequestAfterException: notification send successfull
...
Which leads to Orchagent crash.

GuoHan suspected that this crash was introduced by:
sonic-sairedis: Add support to sonic-sairedis for gearbox phys (#632)
(sonic-net/sonic-sairedis#632)

I have tried commenting out the code that was causing this issue in
sai_switch_type_t SaiSwitch::getSwitchType() where the get switch_type failure on BRCM DUT (with return status -3 NO MEMORY) was causing the SWSS_LOG_THROW("failed to get switch type") that leads to Orchagent crash by not causing the crash but to default to return "SAI_SWITCH_TYPE_NPU" instead. This helped resolved the Orchagent crash issue but it exposed the next layer of issues where it now continuously showing something as following:

Jul 25 20:48:33.344512 str-s6000-acs-8 ERR syncd#syncd: :- guard: RedisReply catches system_error: command: *8#015#012$7#015#012EVALSHA#015#012$40#015#01231fc701ca9b1b9f968f501c92b639f50f6346a9c#015#012$1#015#0121#015#012$19#015#012oid:0x1000000000008#015#012$1#015#0122#015#012$8#015#012COUNTERS#015#012$7#015#0121000000#015#012$2#015#012''#15#012, reason: ERR Error running script (call to f_31fc701ca9b1b9f968f501c92b639f50f6346a9c): @user_script:21: user_script:21: attempt to perform arithmetic on local 'alpha' (a boolean value) : Input/output error
Jul 25 20:48:33.345213 str-s6000-acs-8 ERR syncd#syncd: :- runRedisScript: Caught exception while running Redis lua script: ERR Error running script (call to f_31fc701ca9b1b9f968f501c92b639f50f6346a9c): @user_script:21: user_script:21: attempt to perform arithmetic on local 'alpha' (a boolean value) : Input/output error


BUG REPORT INFORMATION

Steps to reproduce the issue:

  1. Load any master branch build starting with image 353 on any BRCM DUT
  2. Once it boots up you will see Orchagent crashed and if you investigate the syslog you will see the first error I reported above.

Master.350 was the last good build that did not suffer this issue. Unfortunately master.351 and master.352 both had build issues so we don't have their test results.
Output of show version:

admin@str-s6000-acs-8:~$ show version

SONiC Software Version: SONiC.master.353-b3ae7858
Distribution: Debian 10.4
Kernel: 4.19.0-9-2-amd64
Build commit: b3ae785
Build date: Sun Jul 19 13:27:43 UTC 2020
Built by: johnar@jenkins-worker-7

Platform: x86_64-dell_s6000_s1220-r0
HwSKU: Force10-S6000
ASIC: broadcom
Serial Number: 1QBRX42
Uptime: 07:05:06 up 38 min, 1 user, load average: 0.15, 0.10, 0.35

Docker images:
REPOSITORY TAG IMAGE ID SIZE
docker-teamd latest be843e740dd2 380MB
docker-teamd master.353-b3ae7858 be843e740dd2 380MB
docker-nat latest d7faaa0c4d23 382MB
docker-nat master.353-b3ae7858 d7faaa0c4d23 382MB
docker-router-advertiser latest f5d8a5da5e15 350MB
docker-router-advertiser master.353-b3ae7858 f5d8a5da5e15 350MB
docker-platform-monitor latest 844697e3e64a 422MB
docker-platform-monitor master.353-b3ae7858 844697e3e64a 422MB
docker-database latest f63204dd0f61 351MB
docker-database master.353-b3ae7858 f63204dd0f61 351MB
docker-lldp latest ebfc33f71809 377MB
docker-lldp master.353-b3ae7858 ebfc33f71809 377MB
docker-orchagent latest d8c14690e66c 393MB
docker-orchagent master.353-b3ae7858 d8c14690e66c 393MB
docker-sonic-telemetry latest 8764f9533b8d 414MB
docker-sonic-telemetry master.353-b3ae7858 8764f9533b8d 414MB
docker-snmp latest 8605e02d8f8b 390MB
docker-snmp master.353-b3ae7858 8605e02d8f8b 390MB
docker-dhcp-relay latest 870bed6def46 357MB
docker-dhcp-relay master.353-b3ae7858 870bed6def46 357MB
docker-sonic-mgmt-framework latest a1cf99e915fe 473MB
docker-sonic-mgmt-framework master.353-b3ae7858 a1cf99e915fe 473MB
docker-sflow latest 577f5c5eed53 383MB
docker-sflow master.353-b3ae7858 577f5c5eed53 383MB
docker-syncd-brcm latest 21dd40c5d954 447MB
docker-syncd-brcm master.353-b3ae7858 21dd40c5d954 447MB
docker-fpm-frr latest 4e1bfd2f9ab7 340MB
docker-fpm-frr master.353-b3ae7858 4e1bfd2f9ab7 340MB

get_switch_type_failure_casue_orchagent_crash_syslog.txt

@gechiang
Copy link
Collaborator Author

Attaching the syslog after worked around the get switch_type error but then seeing RedisReply system errors...
get_switch_typeBypassed_RedisReply_srror_syslog.txt

@gechiang
Copy link
Collaborator Author

opened the issue in sonic-sairedis
sonic-net/sonic-sairedis#646

@lguohan lguohan linked a pull request Jul 29, 2020 that will close this issue
@lguohan lguohan closed this as completed Jul 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants