Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fast-reboot] APPLY_VIEW failure in handleSaiCreateStatus causing orchagent crash #7798

Closed
vaibhavhd opened this issue Jun 4, 2021 · 7 comments · Fixed by sonic-net/sonic-swss#1815

Comments

@vaibhavhd
Copy link
Contributor

Description

After fast-reboot, during APPLY_VIEW stage, Orchagent crashed and swss kept restarting. From syslog, the error reported is as:
ERR swss#orchagent: :- handleSaiCreateStatus: Encountered failure in create operation, exiting orchagent, SAI API: SAI_API_FDB, status: SAI_STATUS_ITEM_ALREADY_EXISTS

Steps to reproduce the issue:

  1. Execute platform_tests/test_fast_reboot
  2. Test will fail with "FAILED:dut:DUT hasn't started to work for 300 seconds",
  3. Check syslog for more details.

Describe the results you received:

syslog

May 26 18:52:46.998538 str-dx010-acs-4 NOTICE admin: Rebooting with /sbin/kexec -e to SONiC-OS-HEAD.793-953f9134 ...
...
May 26 18:53:10.560740 str-dx010-acs-4 INFO systemd[1]: Starting switch state service...
May 26 18:53:10.571306 str-dx010-acs-4 NOTICE root: Starting swss service...
May 26 18:53:10.579680 str-dx010-acs-4 NOTICE root: Locking /tmp/swss-syncd-lock from swss service
May 26 18:53:10.592509 str-dx010-acs-4 NOTICE root: Locked /tmp/swss-syncd-lock (10) from swss service
May 26 18:53:15.552312 str-dx010-acs-4 NOTICE root: Started swss service...
May 26 18:53:15.559564 str-dx010-acs-4 NOTICE root: Unlocking /tmp/swss-syncd-lock (10) from swss service
May 26 18:53:15.570426 str-dx010-acs-4 INFO systemd[1]: Started switch state service.
...
May 26 18:53:38.087204 str-dx010-acs-4 NOTICE swss#orchagent: :- sai_redis_notify_syncd: switched ASIC to APPLY VIEW

...
May 26 18:53:44.917809 str-dx010-acs-4 NOTICE swss#orchagent: :- handlePortQosMapTable: Applied QoS maps to ports
May 26 18:53:45.018182 str-dx010-acs-4 INFO syncd#syncd: [none] SAI_API_FDB:_brcm_sai_fdb_event_cb:132 fdbEvent: 1 for mac 72-06-00-01-00-07 vid:0x3e8, port:0x3a lagid:0x0 flags:0x10440 flags2:0x0 lag:false station flags 0x0
May 26 18:53:45.018568 str-dx010-acs-4 INFO syncd#syncd: [none] SAI_API_NEIGHBOR:_brcm_sai_nbr_mac_db_update:4977 brcm_sai_mptnl_notify_nhop_event success: 0
May 26 18:53:45.095652 str-dx010-acs-4 INFO syncd#syncd: [none] SAI_API_FDB:_brcm_sai_fdb_event_cb:132 fdbEvent: 1 for mac 72-06-00-01-00-04 vid:0x3e8, port:0x2e lagid:0x0 flags:0x10440 flags2:0x0 lag:false station flags 0x0
May 26 18:53:45.095968 str-dx010-acs-4 INFO syncd#syncd: [none] SAI_API_NEIGHBOR:_brcm_sai_nbr_mac_db_update:4977 brcm_sai_mptnl_notify_nhop_event success: 0
May 26 18:53:45.112870 str-dx010-acs-4 INFO syncd#syncd: message repeated 2 times: [ [none] SAI_API_NEIGHBOR:_brcm_sai_nbr_mac_db_update:4977 brcm_sai_mptnl_notify_nhop_event success: 0]
May 26 18:53:45.112870 str-dx010-acs-4 INFO syncd#syncd: [none] SAI_API_FDB:brcm_sai_create_fdb_entry:707 FDB Create: MAC:72-06-00-01-00-00 port_tid:0x50 port_type:Port vid:0x3e8
May 26 18:53:45.114834 str-dx010-acs-4 INFO syncd#syncd: [none] SAI_API_NEIGHBOR:_brcm_sai_nbr_mac_db_update:4977 brcm_sai_mptnl_notify_nhop_event success: 0
May 26 18:53:45.115026 str-dx010-acs-4 INFO syncd#syncd: [none] SAI_API_NEIGHBOR:_brcm_sai_nbr_mac_db_update:4977 brcm_sai_mptnl_notify_nhop_event success: 0
May 26 18:53:45.115026 str-dx010-acs-4 INFO syncd#syncd: [none] SAI_API_FDB:brcm_sai_create_fdb_entry:707 FDB Create: MAC:72-06-00-01-00-01 port_tid:0x22 port_type:Port vid:0x3e8
May 26 18:53:45.116777 str-dx010-acs-4 INFO syncd#syncd: [none] SAI_API_NEIGHBOR:_brcm_sai_nbr_mac_db_update:4977 brcm_sai_mptnl_notify_nhop_event success: 0
May 26 18:53:45.116990 str-dx010-acs-4 INFO syncd#syncd: [none] SAI_API_NEIGHBOR:_brcm_sai_nbr_mac_db_update:4977 brcm_sai_mptnl_notify_nhop_event success: 0
May 26 18:53:45.116990 str-dx010-acs-4 INFO syncd#syncd: [none] SAI_API_FDB:brcm_sai_create_fdb_entry:707 FDB Create: MAC:72-06-00-01-00-02 port_tid:0x26 port_type:Port vid:0x3e8
May 26 18:53:45.118777 str-dx010-acs-4 INFO syncd#syncd: [none] SAI_API_NEIGHBOR:_brcm_sai_nbr_mac_db_update:4977 brcm_sai_mptnl_notify_nhop_event success: 0
May 26 18:53:45.118982 str-dx010-acs-4 INFO syncd#syncd: [none] SAI_API_NEIGHBOR:_brcm_sai_nbr_mac_db_update:4977 brcm_sai_mptnl_notify_nhop_event success: 0
May 26 18:53:45.118982 str-dx010-acs-4 INFO syncd#syncd: [none] SAI_API_FDB:brcm_sai_create_fdb_entry:707 FDB Create: MAC:72-06-00-01-00-03 port_tid:0x2a port_type:Port vid:0x3e8
May 26 18:53:45.119882 str-dx010-acs-4 NOTICE swss#orchagent: message repeated 27 times: [ :- handlePortQosMapTable: Applied QoS maps to ports]
May 26 18:53:45.119882 str-dx010-acs-4 ERR swss#orchagent: :- meta_sai_validate_fdb_entry: object key SAI_OBJECT_TYPE_FDB_ENTRY:{"bvid":"oid:0x260000000005e4","mac":"72:06:00:01:00:04","switch_id":"oid:0x21000000000000"} already exists
May 26 18:53:45.119882 str-dx010-acs-4 ERR swss#orchagent: :- addFdbEntry: Failed to create dynamic FDB 72:06:00:01:00:04 in Vlan1000 on Ethernet28, rv:-6
...
May 26 18:53:45.119882 str-dx010-acs-4 ERR swss#orchagent: :- meta_sai_validate_fdb_entry: object key SAI_OBJECT_TYPE_FDB_ENTRY:{"bvid":"oid:0x260000000005e4","mac":"72:06:00:01:00:04","switch_id":"oid:0x21000000000000"} already exists
May 26 18:53:45.119882 str-dx010-acs-4 ERR swss#orchagent: :- addFdbEntry: Failed to create dynamic FDB 72:06:00:01:00:04 in Vlan1000 on Ethernet28, rv:-6
May 26 18:53:45.119882 str-dx010-acs-4 ERR swss#orchagent: :- handleSaiCreateStatus: Encountered failure in create operation, exiting orchagent, SAI API: SAI_API_FDB, status: SAI_STATUS_ITEM_ALREADY_EXISTS
May 26 18:53:45.120719 str-dx010-acs-4 NOTICE swss#orchagent: :- uninitialize: begin
May 26 18:53:45.120719 str-dx010-acs-4 NOTICE swss#orchagent: :- uninitialize: begin
May 26 18:53:45.120719 str-dx010-acs-4 NOTICE swss#orchagent: :- ~RedisChannel: join ntf thread begin
May 26 18:53:45.120719 str-dx010-acs-4 NOTICE swss#orchagent: :- ~RedisChannel: join ntf thread end
May 26 18:53:45.121136 str-dx010-acs-4 NOTICE swss#orchagent: :- clear_local_state: clearing local state
May 26 18:53:45.121136 str-dx010-acs-4 NOTICE swss#orchagent: :- meta_init_db: begin
May 26 18:53:45.124774 str-dx010-acs-4 NOTICE swss#orchagent: :- meta_init_db: end
May 26 18:53:45.125207 str-dx010-acs-4 NOTICE swss#orchagent: :- uninitialize: end
May 26 18:53:45.125572 str-dx010-acs-4 NOTICE swss#orchagent: :- stopRecording: stopped recording
May 26 18:53:45.125572 str-dx010-acs-4 NOTICE swss#orchagent: :- stopRecording: closed recording file: sairedis.rec
May 26 18:53:45.125572 str-dx010-acs-4 NOTICE swss#orchagent: :- uninitialize: end
May 26 18:53:45.312173 str-dx010-acs-4 INFO swss#supervisord 2021-05-26 18:53:45,310 INFO exited: orchagent (exit status 1; not expected)
May 26 18:53:46.325191 str-dx010-acs-4 INFO swss#/supervisor-proc-exit-listener: Process 'orchagent' exited unexpectedly. Terminating supervisor 'swss'

sairedis logs

2021-05-26.18:56:20.383025|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_MAX_ACL_ACTION_COUNT=2
2021-05-26.18:56:20.384234|G|SAI_STATUS_ATTR_NOT_SUPPORTED_0|
2021-05-26.18:56:20.410895|a|APPLY_VIEW
2021-05-26.18:56:20.416148|A|SAI_STATUS_FAILURE

Describe the results you expected:

Output of show version:

            "SONiC Software Version: SONiC.HEAD.793-953f9134", 
            "Distribution: Debian 10.9", 
            "Kernel: 4.19.0-12-2-amd64", 
            "Build commit: 953f9134", 
            "Build date: Tue May 25 21:14:45 UTC 2021", 
            "Built by: johnar@jenkins-worker-22",

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

test_advanced_reboot (1).log

@zhangyanzhao
Copy link
Collaborator

@prsunny please take a look

@kcudnik
Copy link
Contributor

kcudnik commented Jun 23, 2021

please attach full syslog and sairedis.rec from that encounter,
probably what's happens: OA tries to reconfigure asic, and during configuration some traffic tart flowing and switch learns some new fdb entries (which are not pushed yet by OA) and OA gets notification of that, and metadata is learning that just before explicitly creating those fdb entries, i need logs to configm that

also notice that error code returned is SAI_STATUS_ITEM_ALREADY_EXISTS, so OA could take notice of that for FDB entry, and just skip that during creation since entry exists, maybe some "SET" attributes could be explicitly set again

@vaibhavhd
Copy link
Contributor Author

Thanks Kamil! The older logs are attached.

But, here I am attaching the latest logs. Also, this issue is inconsistent, and thus I have also seen successful fast-reboot test results.

syslog-fdb-error.txt
sairedis.rec.5.gz

@kcudnik
Copy link
Contributor

kcudnik commented Jun 23, 2021

from first logs:

May 26 18:53:45.095652 str-dx010-acs-4 INFO syncd#syncd: [none] SAI_API_FDB:_brcm_sai_fdb_event_cb:132 fdbEvent: 1 for mac 72-06-00-01-00-04 vid:0x3e8, port:0x2e lagid:0x0 flags:0x10440 flags2:0x0 lag:false station flags 0x0
May 26 18:53:45.119882 str-dx010-acs-4 ERR swss#orchagent: :- meta_sai_validate_fdb_entry: object key SAI_OBJECT_TYPE_FDB_ENTRY:{"bvid":"oid:0x260000000005e4","mac":"72:06:00:01:00:04","switch_id":"oid:0x21000000000000"} already exists

first line is where fdb is learned by hw, second where OA tries to explicitly create fdb
PS. log also contains notifications for port state change that arrive quickly since snoop is done, seems like they arrive right after switch is created before port_list query

@kcudnik
Copy link
Contributor

kcudnik commented Jun 23, 2021

from new attached logs, same scenario:

Jun 23 17:51:53.245722 str2-7050cx3-acs-01 INFO syncd#syncd: [none] SAI_API_FDB:_brcm_sai_fdb_event_cb:132 fdbEvent: 1 for mac 72-06-00-01-00-09 vid:0x73e8, port:0xb000000b lagid:0x0 flags:0x10440 flags2:0x0 lag:false station flags 0x0
Jun 23 17:51:53.245780 str2-7050cx3-acs-01 INFO syncd#syncd: [none] SAI_API_FDB:_brcm_sai_fdb_event_cb:278 fdbEvent: FDB  event: 1 for mac 72-06-00-01-00-09 port_tid:0x35 port_type:1 vid:0x3e8
Jun 23 17:51:54.795520 str2-7050cx3-acs-01 ERR swss#orchagent: :- meta_sai_validate_fdb_entry: object key SAI_OBJECT_TYPE_FDB_ENTRY:{"bvid":"oid:0x2600000000063e","mac":"72:06:00:01:00:09","switch_id":"oid:0x21000000000000"} already exists
Jun 23 17:51:54.795567 str2-7050cx3-acs-01 ERR swss#orchagent: :- addFdbEntry: Failed to create dynamic FDB 72:06:00:01:00:09 in Vlan1000 on Ethernet48, rv:-6
Jun 23 17:51:54.795567 str2-7050cx3-acs-01 ERR swss#orchagent: :- handleSaiCreateStatus: Encountered failure in create operation, exiting orchagent, SAI API: SAI_API_FDB, status: SAI_STATUS_ITEM_ALREADY_EXISTS

72:06:00:01:00:09 mac is learned and about 1.5 seconds later OA tries to create it, and fails since exist already, 1.5 sec seems a long time here compared to previous 0.02sec

2021-06-23.17:51:36.815176|c|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_INIT_SWITCH=true|SAI_SWITCH_ATTR_FDB_EVENT_NOTIFY=0x55cb3c348db0|SAI_SWITCH_ATTR_PORT_STATE_CHANGE_NOTIFY=0x55cb3c348dc0|SAI_SWITCH_ATTR_SWITCH_SHUTDOWN_REQUEST_NOTIFY=0x55cb3c348dd0|SAI_SWITCH_ATTR_SRC_MAC_ADDRESS=94:8E:D3:05:34:E4
2021-06-23.17:51:46.669662|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_DEFAULT_VIRTUAL_ROUTER_ID=oid:0x0
2021-06-23.17:51:52.306854|n|port_state_change|[{"port_id":"oid:0x1000000000003","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
2021-06-23.17:51:52.350489|c|SAI_OBJECT_TYPE_BRIDGE_PORT:oid:0x3a000000000643|SAI_BRIDGE_PORT_ATTR_TYPE=SAI_BRIDGE_PORT_TYPE_PORT|SAI_BRIDGE_PORT_ATTR_PORT_ID=oid:0x1000000000006|SAI_BRIDGE_PORT_ATTR_ADMIN_STATE=true|SAI_BRIDGE_PORT_ATTR_FDB_LEARNING_MODE=SAI_BRIDGE_PORT_FDB_LEARNING_MODE_HW
2021-06-23.17:51:53.096935|n|fdb_event|[{"fdb_entry":"{\"bvid\":\"oid:0x2600000000063e\",\"mac\":\"72:06:00:01:00:13\",\"switch_id\":\"oid:0x21000000000000\"}","fdb_event":"SAI_FDB_EVENT_LEARNED","list":[{"id":"SAI_FDB_ENTRY_ATTR_BRIDGE_PORT_ID","value":"oid:0x3a00000000065b"}]}]|
2021-06-23.17:51:53.246953|n|fdb_event|[{"fdb_entry":"{\"bvid\":\"oid:0x2600000000063e\",\"mac\":\"72:06:00:01:00:09\",\"switch_id\":\"oid:0x21000000000000\"}","fdb_event":"SAI_FDB_EVENT_LEARNED","list":[{"id":"SAI_FDB_ENTRY_ATTR_BRIDGE_PORT_ID","value":"oid:0x3a000000000653"}]}]|
2021-06-23.17:51:54.265363|n|fdb_event|[{"fdb_entry":"{\"bvid\":\"oid:0x2600000000063e\",\"mac\":\"00:06:07:08:09:0A\",\"switch_id\":\"oid:0x21000000000000\"}","fdb_event":"SAI_FDB_EVENT_LEARNED","list":[{"id":"SAI_FDB_ENTRY_ATTR_BRIDGE_PORT_ID","value":"oid:0x3a000000000649"}]}]|

Jun 23 17:51:54.795520 str2-7050cx3-acs-01 ERR swss#orchagent: :- meta_sai_validate_fdb_entry: object key SAI_OBJECT_TYPE_FDB_ENTRY:{"bvid":"oid:0x2600000000063e","mac":"72:06:00:01:00:09","switch_id":"oid:0x21000000000000"} already exists

i selected couple events here from sairedis.rec, switch is created ad 36 seconds, and first operations start sa 46, at 52 ports are getting up and also at 52 bridge ports are created for multiple ports with SAI_BRIDGE_PORT_FDB_LEARNING_MODE_HW, so those bridge ports will be learning new fdbs as they appear on ports, since currently no fdb is present at the switch, and we can see that first fdb is learned at 53, then at 54, and also at 54 OA tries to create fdb with mac 72:06:00:01:00:09 and fails since that already exists and was learned 1.5 seconds before

there are 2 ways to mitigate this:

  • create bridge ports with SAI_BRIDGE_PORT_FDB_LEARNING_MODE_NONE and after explicit create fdb entries, bring back learning mode to SAI_BRIDGE_PORT_FDB_LEARNING_MODE_HW
  • ignore ALREADY_EXISTS error, and treat this like entry was explicitly created

because OA fails and exit of this error, we don't have full configuration recorded here, but from create switch at 17:51:36 to error at 17:51:54 elapses 18 seconds, which is way under 30 seconds expected for fast-boot scenario - to actually see how long it takes we would need to get log from successful run
what's is interesting that we are starting to see this issue recently, it would be nice to have a history of this failures so we could corelate this to some PRs,

PS. from sairedis.rec, right after switch_created we receive many port notification state change with SAI_PORT_OPER_STATUS_DOWN (not sure if we had this before), create switch here takes about 10 seconds, so maybe those notifications have time to build up, but this is actually minor issue

@yxieca
Copy link
Contributor

yxieca commented Jun 28, 2021

I think @kcudnik 's either suggestion is good. But I want to point out:

  1. The meta function is not changed for long time. Why did we not see this issue earlier?
  2. This is a sonic bug. the second creating call didn't reach SAI, it failed in meta functions.

@kcudnik
Copy link
Contributor

kcudnik commented Jun 29, 2021

I think @kcudnik 's either suggestion is good. But I want to point out:

  1. The meta function is not changed for long time. Why did we not see this issue earlier?
  2. This is a sonic bug. the second creating call didn't reach SAI, it failed in meta functions.

Ad1. thats a good question, maybe we have been lucky ? if i remember correctly @vaibhavhd was saying that this error is present for some time, and i pointed out in previous post that would be nice to have history of this test run so we could corelate this to PR that might cause this issue to start

Ad2. what do you mean second creating call ? each fdb entry is created only once explicitly by OA, and it fails, since metadata reported that entry exists, and this is correct behavior. PS. there is one scenario here that we can hit, since notifications are asynchronous, this can happen:

  • OA creates fdb entry, and passes metadata,
  • OA receives notification that exact same entry was already created (entry is already in HW)
  • syncd starts to process fdb_entry create and fails with error alrady_exists, since hw learned that entry

in async mode this scenario will crash syncd, in sync mode you will get already_exists error

shi-su added a commit to sonic-net/sonic-swss that referenced this issue Jul 8, 2021
What I did
Ignore ALREADY_EXIST error in FDB creation.
Fix: sonic-net/sonic-buildimage#7798

Why I did it
In FDB creation, there are scenarios where the hardware learns an FDB entry before orchagent. In such cases, the FDB SAI creation would report the status of SAI_STATUS_ITEM_ALREADY_EXISTS, and orchagent should ignore the error and treat it as entry was explicitly created.
shi-su added a commit to shi-su/sonic-swss that referenced this issue Aug 17, 2021
What I did
Ignore ALREADY_EXIST error in FDB creation.
Fix: sonic-net/sonic-buildimage#7798

Why I did it
In FDB creation, there are scenarios where the hardware learns an FDB entry before orchagent. In such cases, the FDB SAI creation would report the status of SAI_STATUS_ITEM_ALREADY_EXISTS, and orchagent should ignore the error and treat it as entry was explicitly created.
shi-su added a commit to shi-su/sonic-swss that referenced this issue Aug 17, 2021
What I did
Ignore ALREADY_EXIST error in FDB creation.
Fix: sonic-net/sonic-buildimage#7798

Why I did it
In FDB creation, there are scenarios where the hardware learns an FDB entry before orchagent. In such cases, the FDB SAI creation would report the status of SAI_STATUS_ITEM_ALREADY_EXISTS, and orchagent should ignore the error and treat it as entry was explicitly created.
judyjoseph pushed a commit to sonic-net/sonic-swss that referenced this issue Aug 26, 2021
What I did
Ignore ALREADY_EXIST error in FDB creation.
Fix: sonic-net/sonic-buildimage#7798

Why I did it
In FDB creation, there are scenarios where the hardware learns an FDB entry before orchagent. In such cases, the FDB SAI creation would report the status of SAI_STATUS_ITEM_ALREADY_EXISTS, and orchagent should ignore the error and treat it as entry was explicitly created.
raphaelt-nvidia pushed a commit to raphaelt-nvidia/sonic-swss that referenced this issue Oct 5, 2021
What I did
Ignore ALREADY_EXIST error in FDB creation.
Fix: sonic-net/sonic-buildimage#7798

Why I did it
In FDB creation, there are scenarios where the hardware learns an FDB entry before orchagent. In such cases, the FDB SAI creation would report the status of SAI_STATUS_ITEM_ALREADY_EXISTS, and orchagent should ignore the error and treat it as entry was explicitly created.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants