Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[teamd] retry creating team_port after interface info changed #2699

Merged
merged 1 commit into from
Mar 28, 2019

Conversation

yxieca
Copy link
Contributor

@yxieca yxieca commented Mar 25, 2019

- What I did

Race condition has been noticed after warm reboot: sometimes when
port_changed notification was received, the link message didn't
have the device name. Without device name, creating team port
would fail.

Registering to the interface information change notification, so
later when device name becomes available, retry creating team port.

Signed-off-by: Ying Xie ying.xie@microsoft.com

- How to verify it
Continuous warm reboot on my DUT. The retry has been verified with debug messages.

Without the change, continuous warm reboot would fail within 20 iterations. With the fix, the test count has gone up to 78 and still running.

for f in ls *.log; do cnt=grep iteration $f | wc -l; echo $f $cnt; done
wb-test-20190313-0437.log 13
wb-test-20190319-2212.log 2
wb-test-20190319-2307.log 1
wb-test-20190319-2313.log 3
wb-test-20190319-2348.log 6
wb-test-20190320-0119.log 20
wb-test-20190321-0020.log 2
wb-test-20190321-0207.log 10
wb-test-20190321-1733.log 7
wb-test-20190321-1839.log 5
wb-test-20190321-2131.log 7
wb-test-20190322-0133.log 4
wb-test-20190322-0219.log 5
wb-test-20190322-0429.log 3
wb-test-20190322-1656.log 19
wb-test-20190322-2158.log 2
wb-test-20190323-0006.log 7
wb-test-20190323-0650.log 4
wb-test-20190323-0729.log 3
wb-test-20190323-1855.log 2
wb-test-20190323-1911.log 1
wb-test-20190323-1924.log 10
wb-test-20190324-1911.log 78

Race condition has been noticed after warm reboot: sometimes when
port_changed notification was received, the link message didn't
have the device name. Without device name, creating team port
would fail.

Registering to the interface information change notification, so
later when device name becomes available, retry creating team port.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
@jipanyang
Copy link
Collaborator

Could you share the error log collected for the problem? I don't quite understand how "port_changed notification was received, the link message didn't have the device name" happened.

Also I'd like to know whether it is related to the incorrect teammgrd then teamsyncd startup order.

@yxieca
Copy link
Contributor Author

yxieca commented Mar 25, 2019

The logs came with libteamd doesn't have all the information. I had to add quite some extra log entries to eventually piece together the puzzle (keep in mind that anything started with '===' was added or enhanced:

  1. It all started with lag got created, no member initially:
    Mar 25 03:44:07.155421 str-7260cx3-acs-1 DEBUG teamd#teamd_PortChannel0004[47]: <ifinfo_list>
    Mar 25 03:44:07.155421 str-7260cx3-acs-1 DEBUG teamd#teamd_PortChannel0004[47]: 9: PortChannel0004: 74:83:ef:0a:d0:79: 0
    Mar 25 03:44:07.155421 str-7260cx3-acs-1 DEBUG teamd#teamd_PortChannel0004[47]: </ifinfo_list>
    Mar 25 03:44:07.155421 str-7260cx3-acs-1 DEBUG teamd#teamd_PortChannel0004[47]: <port_list>
    Mar 25 03:44:07.155421 str-7260cx3-acs-1 DEBUG teamd#teamd_PortChannel0004[47]: *104: : up 0Mbit HD
    Mar 25 03:44:07.155421 str-7260cx3-acs-1 DEBUG teamd#teamd_PortChannel0004[47]: </port_list>

  2. At some point, the port change was notified, noting that we know the interface index is 104 but no device name yet:
    Mar 25 03:44:07.155421 str-7260cx3-acs-1 ERR teamd#teamd_PortChannel0004[47]: === allocating port_obj 0x240b6e0 for team_port 0x2406cf0 name ifindex 104 ifinfo 0x24206e0

  3. That triggers creating lag failure (the first one was added, the second one was enhanced, you should be able to see "Failed to init port priv" in log:
    Mar 25 03:44:07.164984 str-7260cx3-acs-1 ERR teamd#teamd_PortChannel0004[47]: === ioctl SIOCADDMULTI failed: dev '' (0x2420750). lp 0x2419740 tp 0x240b6e0
    Mar 25 03:44:07.189333 str-7260cx3-acs-1 ERR teamd#teamd_PortChannel0004[47]: === Failed to init port priv. po 0x240b6e0, ppi 0x2419720, ppi->priv 0x2419740, ppt->c_priv 0x2419738

  4. Shortly later, you see another event, this time with device name Ethernet72. Without registering for interface change, you don't get this event and the lag won't be created, since there would be no more port change for Ethernet72:
    Mar 25 03:44:07.192357 str-7260cx3-acs-1 ERR teamd#teamd_PortChannel0004[47]: === allocating port_obj 0x240b6e0 for team_port 0x2406cf0 name Ethernet72 ifindex 104 ifinfo 0x24206e0

  5. Now you have 1 port in lag, the other port came up with proper device name and the lag is then ready (of course the race condition could happen to both port at the same time):
    Mar 25 03:44:07.223354 str-7260cx3-acs-1 DEBUG teamd#teamd_PortChannel0004[47]: <ifinfo_list>
    Mar 25 03:44:07.223354 str-7260cx3-acs-1 DEBUG teamd#teamd_PortChannel0004[47]: 104: Ethernet72: 74:83:ef:0a:d0:79: 9
    Mar 25 03:44:07.223354 str-7260cx3-acs-1 DEBUG teamd#teamd_PortChannel0004[47]: 9: PortChannel0004: 74:83:ef:0a:d0:79: 0
    Mar 25 03:44:07.223354 str-7260cx3-acs-1 DEBUG teamd#teamd_PortChannel0004[47]: </ifinfo_list>
    Mar 25 03:44:07.223354 str-7260cx3-acs-1 DEBUG teamd#teamd_PortChannel0004[47]: <port_list>
    Mar 25 03:44:07.223354 str-7260cx3-acs-1 DEBUG teamd#teamd_PortChannel0004[47]: *107: Ethernet76: up 0Mbit HD
    Mar 25 03:44:07.223354 str-7260cx3-acs-1 DEBUG teamd#teamd_PortChannel0004[47]: 104: Ethernet72: up 0Mbit HD
    Mar 25 03:44:07.223354 str-7260cx3-acs-1 DEBUG teamd#teamd_PortChannel0004[47]: </port_list>
    Mar 25 03:44:07.223354 str-7260cx3-acs-1 ERR teamd#teamd_PortChannel0004[47]: === allocating port_obj 0x2422e80 for team_port 0x240bbd0 name Ethernet76 ifindex 107 ifinfo 0x240ec70
    Mar 25 03:44:07.223354 str-7260cx3-acs-1 DEBUG teamd#teamd_PortChannel0004[47]: Ethernet76: Using implicit link watch.
    Mar 25 03:44:07.223354 str-7260cx3-acs-1 DEBUG teamd#teamd_PortChannel0004[47]: Ethernet76: Got link watch from port config.
    Mar 25 03:44:07.223354 str-7260cx3-acs-1 DEBUG teamd#teamd_PortChannel0004[47]: Ethernet76: Using lacp_prio "255".
    Mar 25 03:44:07.223354 str-7260cx3-acs-1 DEBUG teamd#teamd_PortChannel0004[47]: Ethernet76: Using lacp_key "0".
    Mar 25 03:44:07.223354 str-7260cx3-acs-1 DEBUG teamd#teamd_PortChannel0004[47]: Ethernet76: Using sticky "0".
    Mar 25 03:44:07.223354 str-7260cx3-acs-1 DEBUG teamd#teamd_PortChannel0004[47]: Added loop callback: lacp_socket, 0x2406e10
    Mar 25 03:44:07.223354 str-7260cx3-acs-1 DEBUG teamd#teamd_PortChannel0004[47]: Added loop callback: lacp_periodic, 0x2406e10
    Mar 25 03:44:07.223354 str-7260cx3-acs-1 DEBUG teamd#teamd_PortChannel0004[47]: Ethernet76: Setting periodic timer to "slow".
    Mar 25 03:44:07.223354 str-7260cx3-acs-1 DEBUG teamd#teamd_PortChannel0004[47]: Added loop callback: lacp_timeout, 0x2406e10
    Mar 25 03:44:07.223882 str-7260cx3-acs-1 DEBUG teamd#teamd_PortChannel0004[47]: <ifinfo_list>
    Mar 25 03:44:07.223882 str-7260cx3-acs-1 DEBUG teamd#teamd_PortChannel0004[47]: 107: Ethernet76: 74:83:ef:0a:d0:79: 9
    Mar 25 03:44:07.223882 str-7260cx3-acs-1 DEBUG teamd#teamd_PortChannel0004[47]: 104: Ethernet72: 74:83:ef:0a:d0:79: 9
    Mar 25 03:44:07.223882 str-7260cx3-acs-1 DEBUG teamd#teamd_PortChannel0004[47]: 9: PortChannel0004: 74:83:ef:0a:d0:79: 0
    Mar 25 03:44:07.223882 str-7260cx3-acs-1 DEBUG teamd#teamd_PortChannel0004[47]: </ifinfo_list>

Mostly, the race was between the libnl and libteamd, but teamd has to be able to handle it.

@jipanyang
Copy link
Collaborator

Currently TeamMgr::doLagMemberTask() makes sure member port state is ok before adding the port to LAG. Port state ok was set by portsyncd upon receiving RTM_NEWLINK message which contains the correct device name.

Why when the processing reaches teamd later, device/link name becomes unavailable? bug in teamd?

@yxieca
Copy link
Contributor Author

yxieca commented Mar 25, 2019

I had the same question, but with the information from you question, I think I have an answer now:

  1. It is a known issue that initial RTM_NEWLINK could come up without device name.
  2. Teamdmgr dealt with it by adding a wait until device name is available.
  3. Similar thing were now done in teamd.

Why we need to do it again? Because after warm reboot, the LAGs are restored by teamd. This operation bypassed the teammgrd protection.

So I guess we really need to fix it in libnl, I looked at libnl while investigating this issue. I didn't find a good way of patching libnl. Maybe someone more familiar with libnl could take another look?

Cheers,
Ying

@yxieca
Copy link
Contributor Author

yxieca commented Mar 25, 2019

Overnight test reached 172 iterations, all 4 peers see lags up for ~21 hours.

yinxi@acs-trusty8:$ ssh vm-1 "show interface po1"
Password:
Port-Channel1 is up, line protocol is up (connected)
Hardware is Port-Channel, address is 5254.003f.38de
Internet address is 10.0.0.33/31
Broadcast address is 255.255.255.255
Address determined by manual configuration
IPv6 link-local address is fe80::5054:ff:fe3f:38de/64
IPv6 global unicast address(es):
fc00::22, subnet is fc00::20/126
IP MTU 1500 bytes
Full-duplex, Unconfigured
Active members in this channel: 2
... Ethernet1 , Full-duplex, Unconfigured
... Ethernet2 , Full-duplex, Unconfigured
Fallback mode is: off
Up 20 hours, 51 minutes, 7 seconds
127 link status changes since last clear
Last clearing of "show interface" counters never
5 minutes input rate 8.75 kbps (- with framing overhead), 3 packets/sec
5 minutes output rate 20.7 kbps (- with framing overhead), 4 packets/sec
30029550 packets input, 4372447368 bytes
Received 4477 broadcasts, 589101 multicast
0 input errors, 0 input discards
11082681 packets output, 3004702653 bytes
Sent 514 broadcasts, 525686 multicast
0 output errors, 0 output discards
yinxi@acs-trusty8:$ ssh vm-2 "show interface po1"
Password:
Port-Channel1 is up, line protocol is up (connected)
Hardware is Port-Channel, address is 5254.0046.c8e3
Internet address is 10.0.0.35/31
Broadcast address is 255.255.255.255
Address determined by manual configuration
IPv6 link-local address is fe80::5054:ff:fe46:c8e3/64
IPv6 global unicast address(es):
fc00::26, subnet is fc00::24/126
IP MTU 1500 bytes
Full-duplex, Unconfigured
Active members in this channel: 2
... Ethernet1 , Full-duplex, Unconfigured
... Ethernet2 , Full-duplex, Unconfigured
Fallback mode is: off
Up 20 hours, 51 minutes, 15 seconds
668 link status changes since last clear
Last clearing of "show interface" counters never
5 minutes input rate 9.49 kbps (- with framing overhead), 3 packets/sec
5 minutes output rate 20.2 kbps (- with framing overhead), 4 packets/sec
45463370 packets input, 7356854288 bytes
Received 6384 broadcasts, 920840 multicast
0 input errors, 0 input discards
17286065 packets output, 5228048304 bytes
Sent 2327 broadcasts, 839577 multicast
0 output errors, 0 output discards
yinxi@acs-trusty8:$ ssh vm-3 "show interface po1"
Password:
Port-Channel1 is up, line protocol is up (connected)
Hardware is Port-Channel, address is 5254.0053.0dbf
Internet address is 10.0.0.37/31
Broadcast address is 255.255.255.255
Address determined by manual configuration
IPv6 link-local address is fe80::5054:ff:fe53:dbf/64
IPv6 global unicast address(es):
fc00::2a, subnet is fc00::28/126
IP MTU 1500 bytes
Full-duplex, Unconfigured
Active members in this channel: 2
... Ethernet1 , Full-duplex, Unconfigured
... Ethernet2 , Full-duplex, Unconfigured
Fallback mode is: off
Up 20 hours, 51 minutes, 37 seconds
756 link status changes since last clear
Last clearing of "show interface" counters never
5 minutes input rate 23.8 kbps (- with framing overhead), 10 packets/sec
5 minutes output rate 19.6 kbps (- with framing overhead), 3 packets/sec
30547304 packets input, 5701239147 bytes
Received 6296 broadcasts, 908131 multicast
0 input errors, 0 input discards
14157728 packets output, 4789280043 bytes
Sent 3299 broadcasts, 822843 multicast
0 output errors, 0 output discards
yinxi@acs-trusty8:$ ssh vm-4 "show interface po1"
Password:
Port-Channel1 is up, line protocol is up (connected)
Hardware is Port-Channel, address is 5254.00bb.81fd
Internet address is 10.0.0.39/31
Broadcast address is 255.255.255.255
Address determined by manual configuration
IPv6 link-local address is fe80::5054:ff:febb:81fd/64
IPv6 global unicast address(es):
fc00::2e, subnet is fc00::2c/126
IP MTU 1500 bytes
Full-duplex, Unconfigured
Active members in this channel: 2
... Ethernet1 , Full-duplex, Unconfigured
... Ethernet2 , Full-duplex, Unconfigured
Fallback mode is: off
Up 20 hours, 51 minutes, 39 seconds
768 link status changes since last clear
Last clearing of "show interface" counters never
5 minutes input rate 21.1 kbps (- with framing overhead), 3 packets/sec
5 minutes output rate 18.8 kbps (- with framing overhead), 3 packets/sec
27506835 packets input, 5774821693 bytes
Received 6208 broadcasts, 741207 multicast
0 input errors, 0 input discards
10433764 packets output, 4298133173 bytes
Sent 3309 broadcasts, 640459 multicast
0 output errors, 0 output discards

@yxieca yxieca requested a review from jipanyang March 25, 2019 15:14
@jipanyang
Copy link
Collaborator

"It is a known issue that initial RTM_NEWLINK could come up without device name.
Teamdmgr dealt with it by adding a wait until device name is available.
" Could you help point me the the issue/code change? TeamMgr::addLagMember() doesn't seem to have the exact processing.

For system warm reboot, the kernel is clean, though lacp state will be restored from the saved lacp pdu file, LAG and lag member are managed by teammmgrd. How this operation bypassed the teammgrd protection?

Your change looks fixing the problem. I just want to get a clear understanding of the problem scenario

@yxieca
Copy link
Contributor Author

yxieca commented Mar 25, 2019

Things are getting more interesting. The libnl has an 'optimization' that it won't report interface information change if it has been changed once. Looks like we might be hitting that too.

At 182th iteration, test hit an issue where teamd repeated retry creating port_obj with empty name. Now we need even longer test cycle to proof safety.

@yxieca
Copy link
Contributor Author

yxieca commented Mar 25, 2019

Jipan,

I saw some change under warm reboot, teamd would do things differently. That might be mainly to make sure that we use same lag member ID to prevent peer tearing down the LAGs.

What you claimed might be right. I didn't fully track down who created the new LAGs. One thing I know is that the LAGs are created very early on after warm reboot, they are usually already created by the time I can ssh in.

Cheers,
Ying

@yxieca yxieca merged commit 80d6594 into sonic-net:master Mar 28, 2019
@yxieca yxieca deleted the teamd branch March 28, 2019 16:57
yxieca added a commit that referenced this pull request Mar 28, 2019
Race condition has been noticed after warm reboot: sometimes when
port_changed notification was received, the link message didn't
have the device name. Without device name, creating team port
would fail.

Registering to the interface information change notification, so
later when device name becomes available, retry creating team port.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
tiantianlv pushed a commit to SONIC-DEV/sonic-buildimage that referenced this pull request Apr 10, 2019
…net#2699)

Race condition has been noticed after warm reboot: sometimes when
port_changed notification was received, the link message didn't
have the device name. Without device name, creating team port
would fail.

Registering to the interface information change notification, so
later when device name becomes available, retry creating team port.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
tiantianlv pushed a commit to SONIC-DEV/sonic-buildimage that referenced this pull request Apr 10, 2019
…net#2699)

Race condition has been noticed after warm reboot: sometimes when
port_changed notification was received, the link message didn't
have the device name. Without device name, creating team port
would fail.

Registering to the interface information change notification, so
later when device name becomes available, retry creating team port.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
tiantianlv pushed a commit to SONIC-DEV/sonic-buildimage that referenced this pull request Apr 10, 2019
…net#2699)

Race condition has been noticed after warm reboot: sometimes when
port_changed notification was received, the link message didn't
have the device name. Without device name, creating team port
would fail.

Registering to the interface information change notification, so
later when device name becomes available, retry creating team port.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
dprital added a commit to dprital/sonic-buildimage that referenced this pull request May 1, 2023
Update sonic-utilities submodule pointer to include the following:
* 88ffb167 [config]config reload should generate sysinfo if missing ([sonic-net#2778](sonic-net/sonic-utilities#2778))
* 7443b9e5 [sonic-package-manager] support extension with multiple YANG modules ([sonic-net#2752](sonic-net/sonic-utilities#2752))
* 522c3a9e [sonic-package-manager] add support for multiple CLI plugin files ([sonic-net#2753](sonic-net/sonic-utilities#2753))
* b38fcfd1 [show][muxcable] fix  RC ([sonic-net#2812](sonic-net/sonic-utilities#2812))
* 7e24463f [chassis]: remote cli commands infra for sonic chassis ([sonic-net#2701](sonic-net/sonic-utilities#2701))
* bee593e4 [DPB]Fixing typo in config breakout output ([sonic-net#2802](sonic-net/sonic-utilities#2802))
* ada603c5 [config]Support multi-asic  Golden Config override ([sonic-net#2738](sonic-net/sonic-utilities#2738))
* 88a7daa8 [show][barefoot] replace shell=True ([sonic-net#2699](sonic-net/sonic-utilities#2699))
* 5e99edb5 [sonic_package_manager] replace shell=True ([sonic-net#2726](sonic-net/sonic-utilities#2726))
* b547bb45 [acl-loader] Only add default deny rule when table is L3 or L3V6 ([sonic-net#2796](sonic-net/sonic-utilities#2796))

Signed-off-by: dprital <drorp@nvidia.com>
StormLiangMS added a commit that referenced this pull request May 31, 2023
Why I did it
69abbc3c - (HEAD, origin/master, origin/HEAD) Revert "[GCU] Complete RDMA Platform Validation Checks [device][platform] Update Inventec new platform d6356 #2791" DellEMC S6100 Watchdog Support #2854 (8 minutes ago)
4fead896 - [sonic-package-manager] fix CLI plugin compatibility issue [sonic-utilities] advance submodule head to latest #2842 (27 hours ago)
db61efca - [vlan][dhcp_relay] Clear dhcpv6 relay counter while deleting vlan ([201811] [services] Restart SwSS service upon unexpected critical process exit #2852) (33 hours ago)
d5544b4a - [config] Generate sysinfo as needed when override config ([minigraph]: Add mirror type v6 condition #2836) (6 days ago)
f258e2a3 - [GCU] Complete RDMA Platform Validation Checks ([device][platform] Update Inventec new platform d6356 #2791) (6 days ago)
b4f4e63e - Revert "Revert frr route check ([mlnx] fix url inconsistency in fw.mk #2761)" (Support TACACS Accounting #2762) (7 days ago)
3d89589f - Update pcieutil error message on loading common pcie module (Enable Debugs in BCM Kernel-bde and Knet Modules #2786) (11 days ago)
e6aacd37 - Update TRANSCEIVER_INFO table after CDB FW upgrade (Remove unused packages in docker images and host (#2807) #2837) (2 weeks ago)
33d665c4 - replace shell=True, replace xml, and replace exit() ([mellanox-simx] add ability to build simx-compatiable image #2664) (2 weeks ago)
9e510a83 - [chassis][voq[Add "config fabric port ..." commands and tests. (Watchdog enable/disable in DellEMC S6100  #2730) (2 weeks ago)
aeb0dbc1 - Fix the invalid variable issue when set-fips in uboot (fix bug in file sonic-cfggen #2834) (3 weeks ago)
1e73632d - [test]: add UT coverage for GCU (Feed device info to orchagent process #2818) (3 weeks ago)
3a9995b6 - [config]Support multi-asic Golden Config override with fix ([mellanox] Update Mellanox MFT packedge #2825) (3 weeks ago)
3fb32588 - Revert "[chassis]: remote cli commands infra for sonic chassis ([mellanox] add makefiles to build Mellanox SDK from sources  #2701)" ([dhcp_relay] Base DHCP Relay Docker container on Debian Stretch #2832) (3 weeks ago)
2ffe6e37 - [show][mlnx] replace shell=True, replace xml (Add support of HwSKU Mellanox-SN2700-C28D8 #2700) (3 weeks ago)
a5091bba - [sonic_sku_create] remove shell=True, replace exit() with sys.exit() (removed exec from script which that prevents the further lines to be … #2816) (3 weeks ago)
71ef4f16 - [build] Fix base OS compilation issue caused by incompatibility with requests >= 2.29.0. ([201811][sairedis][utilities] advance sub module heads #2830) (3 weeks ago)
1097373b - [show] Added alias interface mode support for 'show interfaces counters ...' command ([kernel]: update sonic kernel to 4.9.0-8-2 #2468) (4 weeks ago) <Julian Chang - TW>
589375fc - correctly parsing complete ipv6 vnet info ([201811][mellanox] Update Mellanox FW version to 13.1910.0928 #2827) (4 weeks ago)
634ac77c - LAG keepalive script to reduce lacp session wait during warm-reboot (Set proper hostname on containers startup #2806) (4 weeks ago)
331c9de0 - [config]: Dynamically start and stop ndppd ([Arista] Add QoS needed files for Arista 7170 #2814) (4 weeks ago)
d1f307d0 - [GCU]Fix rdma check failure ([device/celestica]: Add fwutil #2824) (4 weeks ago)
ce81a340 - Revert "[config]Support multi-asic Golden Config override (Before issue “sonic-clear counters”, “show interface counters” result not complete #2738)" ([BGP docker]: start bgp_eoiu_mark service to populate bgp eoiu marker… #2823) (4 weeks ago)
61e0e810 - Added platform plugin support in load_minigraph ([db migrator] migrate the DB to latest schema when needed #2808) (4 weeks ago)
d4355a96 - Change default CDB run mode to non-hitless (Revert "Watchdog enable/disable in DellEMC S6100 " #2817) (4 weeks ago)
88ffb167 - [config]config reload should generate sysinfo if missing ([Mellanox] Update SAI #2778) (4 weeks ago)
7443b9e5 - [sonic-package-manager] support extension with multiple YANG modules (dhcp_relay service stopped with "systemctl stop swss" but not restarted with "systemctl restart swss" #2752) (4 weeks ago)
522c3a9e - [sonic-package-manager] add support for multiple CLI plugin files (Updated Makefile infrastructure to build debug images. #2753) (4 weeks ago)
b38fcfd1 - [show][muxcable] fix show mux hwmode muxdirection RC (syncd-rpc.mk: Fix stretch dockers build failure #2812) (5 weeks ago)
7e24463f - [chassis]: remote cli commands infra for sonic chassis ([mellanox] add makefiles to build Mellanox SDK from sources  #2701) (6 weeks ago)
bee593e4 - [DPB]Fixing typo in config breakout output ([submodule update]: Quagga bgpd crash fix #2802) (6 weeks ago)
ada603c5 - [config]Support multi-asic Golden Config override (Before issue “sonic-clear counters”, “show interface counters” result not complete #2738) (6 weeks ago)
88a7daa8 - [show][barefoot] replace shell=True ([teamd] retry creating team_port after interface info changed #2699) (6 weeks ago)
5e99edb5 - [sonic_package_manager] replace shell=True (Upgrade Mellanox HW-MGMT: fix high CPU utilization issue #2726) (6 weeks ago)
b547bb45 - [acl-loader] Only add default deny rule when table is L3 or L3V6 ([201811] [radvd] Build radvd from source; Patch so as not to treat out-of-range MTU as an error #2796) (6 weeks ago)
mihirpat1 pushed a commit to mihirpat1/sonic-buildimage that referenced this pull request Jun 14, 2023
…#2699)

**What I did**

Enforce the order when the shared headroom pool is enabled.

**Why I did it**

The current flow to enable the shared headroom pool
1. Configure the shared headroom pool size or over-subscribe ratio
2. Update lossless buffer profiles with `xon == size`
3. Calculate and update the shared headroom pool size.

In step 2, the lossless buffer profiles have been updated to values as if the shared headroom pool is enabled. However, it is enabled only in step 3, which is inconsistent between steps 2 and 3. Therefore, we open the PR to guarantee the order.

The new flow
1. A user configures the shared headroom pool size or over-subscribe ratio
2. The dynamic buffer manager invokes the vendor-specific Lua plugin to calculate the shared headroom pool size
    - This is the step introduced in this PR to guarantee the shared headroom pool will be enabled in advance
    - On Nvidia platform, a non-zero shared headroom pool is returned in this stage if the user configures the over-subscribe ratio
3. If a non-zero shared headroom pool is returned, the dynamic buffer manager pushes the shared headroom pool size to APPL_DB.ingress_lossless_pool and blocks until it has been updated into APPL_STATE_DB.ingress_lossless_pool (which indicates the buffer orchagent finishes handling it)
4. The buffer manager updates the lossless buffer profiles
5. The buffer manager invokes the Lua plugin to calculate the shared headroom pool size.
6. The flow continues as normal.

**How I verified it**

Manually test and regression test
sonic-otn pushed a commit to sonic-otn/sonic-buildimage that referenced this pull request Sep 20, 2023
Why I did it
69abbc3c - (HEAD, origin/master, origin/HEAD) Revert "[GCU] Complete RDMA Platform Validation Checks [device][platform] Update Inventec new platform d6356 sonic-net#2791" DellEMC S6100 Watchdog Support sonic-net#2854 (8 minutes ago)
4fead896 - [sonic-package-manager] fix CLI plugin compatibility issue [sonic-utilities] advance submodule head to latest sonic-net#2842 (27 hours ago)
db61efca - [vlan][dhcp_relay] Clear dhcpv6 relay counter while deleting vlan ([201811] [services] Restart SwSS service upon unexpected critical process exit sonic-net#2852) (33 hours ago)
d5544b4a - [config] Generate sysinfo as needed when override config ([minigraph]: Add mirror type v6 condition sonic-net#2836) (6 days ago)
f258e2a3 - [GCU] Complete RDMA Platform Validation Checks ([device][platform] Update Inventec new platform d6356 sonic-net#2791) (6 days ago)
b4f4e63e - Revert "Revert frr route check ([mlnx] fix url inconsistency in fw.mk sonic-net#2761)" (Support TACACS Accounting sonic-net#2762) (7 days ago)
3d89589f - Update pcieutil error message on loading common pcie module (Enable Debugs in BCM Kernel-bde and Knet Modules sonic-net#2786) (11 days ago)
e6aacd37 - Update TRANSCEIVER_INFO table after CDB FW upgrade (Remove unused packages in docker images and host (sonic-net#2807) sonic-net#2837) (2 weeks ago)
33d665c4 - replace shell=True, replace xml, and replace exit() ([mellanox-simx] add ability to build simx-compatiable image sonic-net#2664) (2 weeks ago)
9e510a83 - [chassis][voq[Add "config fabric port ..." commands and tests. (Watchdog enable/disable in DellEMC S6100  sonic-net#2730) (2 weeks ago)
aeb0dbc1 - Fix the invalid variable issue when set-fips in uboot (fix bug in file sonic-cfggen sonic-net#2834) (3 weeks ago)
1e73632d - [test]: add UT coverage for GCU (Feed device info to orchagent process sonic-net#2818) (3 weeks ago)
3a9995b6 - [config]Support multi-asic Golden Config override with fix ([mellanox] Update Mellanox MFT packedge sonic-net#2825) (3 weeks ago)
3fb32588 - Revert "[chassis]: remote cli commands infra for sonic chassis ([mellanox] add makefiles to build Mellanox SDK from sources  sonic-net#2701)" ([dhcp_relay] Base DHCP Relay Docker container on Debian Stretch sonic-net#2832) (3 weeks ago)
2ffe6e37 - [show][mlnx] replace shell=True, replace xml (Add support of HwSKU Mellanox-SN2700-C28D8 sonic-net#2700) (3 weeks ago)
a5091bba - [sonic_sku_create] remove shell=True, replace exit() with sys.exit() (removed exec from script which that prevents the further lines to be … sonic-net#2816) (3 weeks ago)
71ef4f16 - [build] Fix base OS compilation issue caused by incompatibility with requests >= 2.29.0. ([201811][sairedis][utilities] advance sub module heads sonic-net#2830) (3 weeks ago)
1097373b - [show] Added alias interface mode support for 'show interfaces counters ...' command ([kernel]: update sonic kernel to 4.9.0-8-2 sonic-net#2468) (4 weeks ago) <Julian Chang - TW>
589375fc - correctly parsing complete ipv6 vnet info ([201811][mellanox] Update Mellanox FW version to 13.1910.0928 sonic-net#2827) (4 weeks ago)
634ac77c - LAG keepalive script to reduce lacp session wait during warm-reboot (Set proper hostname on containers startup sonic-net#2806) (4 weeks ago)
331c9de0 - [config]: Dynamically start and stop ndppd ([Arista] Add QoS needed files for Arista 7170 sonic-net#2814) (4 weeks ago)
d1f307d0 - [GCU]Fix rdma check failure ([device/celestica]: Add fwutil sonic-net#2824) (4 weeks ago)
ce81a340 - Revert "[config]Support multi-asic Golden Config override (Before issue “sonic-clear counters”, “show interface counters” result not complete sonic-net#2738)" ([BGP docker]: start bgp_eoiu_mark service to populate bgp eoiu marker… sonic-net#2823) (4 weeks ago)
61e0e810 - Added platform plugin support in load_minigraph ([db migrator] migrate the DB to latest schema when needed sonic-net#2808) (4 weeks ago)
d4355a96 - Change default CDB run mode to non-hitless (Revert "Watchdog enable/disable in DellEMC S6100 " sonic-net#2817) (4 weeks ago)
88ffb167 - [config]config reload should generate sysinfo if missing ([Mellanox] Update SAI sonic-net#2778) (4 weeks ago)
7443b9e5 - [sonic-package-manager] support extension with multiple YANG modules (dhcp_relay service stopped with "systemctl stop swss" but not restarted with "systemctl restart swss" sonic-net#2752) (4 weeks ago)
522c3a9e - [sonic-package-manager] add support for multiple CLI plugin files (Updated Makefile infrastructure to build debug images. sonic-net#2753) (4 weeks ago)
b38fcfd1 - [show][muxcable] fix show mux hwmode muxdirection RC (syncd-rpc.mk: Fix stretch dockers build failure sonic-net#2812) (5 weeks ago)
7e24463f - [chassis]: remote cli commands infra for sonic chassis ([mellanox] add makefiles to build Mellanox SDK from sources  sonic-net#2701) (6 weeks ago)
bee593e4 - [DPB]Fixing typo in config breakout output ([submodule update]: Quagga bgpd crash fix sonic-net#2802) (6 weeks ago)
ada603c5 - [config]Support multi-asic Golden Config override (Before issue “sonic-clear counters”, “show interface counters” result not complete sonic-net#2738) (6 weeks ago)
88a7daa8 - [show][barefoot] replace shell=True ([teamd] retry creating team_port after interface info changed sonic-net#2699) (6 weeks ago)
5e99edb5 - [sonic_package_manager] replace shell=True (Upgrade Mellanox HW-MGMT: fix high CPU utilization issue sonic-net#2726) (6 weeks ago)
b547bb45 - [acl-loader] Only add default deny rule when table is L3 or L3V6 ([201811] [radvd] Build radvd from source; Patch so as not to treat out-of-range MTU as an error sonic-net#2796) (6 weeks ago)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants