Skip to content

Latest commit

 

History

History
593 lines (504 loc) · 30.3 KB

active_active_hld.md

File metadata and controls

593 lines (504 loc) · 30.3 KB

Active-Active Dual ToR

Active-active dual ToR link manager is an evolution of active-standby dual ToR link manager. Both ToRs are expected to handle traffic in normal scenarios. For consistency, we will keep using the term "standby" to refer inactive links or ToRs.

Revision

Rev Date Author Change Description
0.1 05/23/22 Jing Zhang Initial version
0.2 12/02/22 Longxiang Lyu Add Traffic Forwarding section
0.3 12/08/22 Longxiang Lyu Add BGP update delay section
0.4 12/13/22 Longxiang Lyu Add skip ACL section
0.5 04/10/23 Longxiang Lyu Add command line section

Scope

This document provides the high level design of SONiC dual toR solution, supporting active-active setup.

Content

1 Cluster Topology

2 Requrement Overview

3 SONiC ToR Controlled Solution

4 Warm Reboot Support

1 Cluster Topology

There are a certain number of racks in a row, each rack will have 2 ToRs, and each row will have 8 Tier One (T1s) network devices. Each server will have a NIC connected to 2 ToRs with 100 Gbps DAC cables.

In this design:

  • Both upper ToR (labeled as UT0) and lower ToR (labeled as LT0) will advertise same IP to upstream T1s, each T1 will see 2 available next hops for the VLAN.
  • Both UT0 and LT0 are expected to carry traffic in normal scenarios.
  • The software stack on server host will see 200 Gbps NIC.

image info

2 Requrement Overview

2.1 Server Requirements

In our cluster setup, as smart y-cable is replaced, some complexity shall be transferred to server NIC.

Note that, this complexity can be handled by active-active smart cables, or any other deployments, as long as long it meets the requirements below.

  1. Server NIC is responsible to deliver southbound (tier 0 device to server) traffic from either uplinks to applications running on server host.

    • ToRs are presenting same IP, same MAC to server on both links.
  2. Server NIC is responsible to dispense northbound (server to tier 0) traffic between two active links: at IO stream (5 tuples) level. Each stream will be dispatched to one of the 2 uplinks until link state changes.

  3. Server should provide support for ToR to control traffic forwarding, and follow this control when dispensing traffic.

    • gRPC is introduced for this requirement.
    • Each ToR will have a well-known IP. Server NIC should dispatch gRPC replies towards these IPs to the corresponding uplinks.
  4. Server NIC should avoid sending traffic through unhealthy links when detecting a link state down.

  5. Server should replicate these northbound traffic to both ToRs:

    • Specified ICMP replies (for probing link health status)
    • ARP propagation
    • IPv6 router solicitation, neighbor solicitation and neighbor advertisements

    Check pseudo code below for details of IO scheduling contract.

    // gRPC Response
    if (ethertype == IPv4 && DestIP == Loopback3_Port0_IPv4) or (ethertype == IPv6 && DestIP == Loopback3_Port0_IPv6) 
    { 
      if (Port 0.LinkState == Up)
        Send to Port 0
      else
        Drop
    }
    else if (ethertype == IPv4 && DestIP == Loopback3_Port1_IPv4) or (ethertype == IPv6 && DestIP == Loopback3_Port1_IPv6)  
    { 
      if (Port 1.LinkState == Up)
        Send to Port 1
      else
        Drop
    } 
    
    // ARP
    else if (ethertype == ARP)
      Duplicate to both ports
    
    // ICMP Heartbeat Probing 
    else if ((ethertype == IPv4 && DestIP == Loopback2_IPv4 && IPv4.Protocol == ICMP) or (ethertype == IPv6 && DestIP == Loopback2_IPv6 && IPv6.Protocol == ICMPv6))
      Duplicate to all active ports
    
    // IPv6 router solicitation, neighbor solicitation and neighbor advertisements
    else if (ethertype == IPv6 && IPv6.Protocol == ICMPv6 && ICMPv6.Type in [133, 135, 136])
      Duplicate to both ports
    else if (gRPC status == "Port 0 disabled" || Port0.LinkState == Down)
      Send to Port 1
    else if (gRPC status == "Port 1 disabled" || Port1.LinkState == Down)
      Send to Port 0
    
    // Other Traffic
    else
      Send packet on either port
    

2.2 SONiC Requirements

  1. Introduce active-active mode into MUX state machine.
  2. Probe to determine if link is healthy or not.
  3. Signal NIC if ToR is switching active or standby.
  4. Rescue when peer ToR failure occures.
  5. Unblock traffic when cable control channel is unreachable.

3 SONiC ToR Controlled Solution

3.1 IP Routing

3.1.1 Normal Scenario

Both T0s are up and functioning and both the server NIC connections are up and functioning.

  • Control Plane
    UT0 and LT0 will advertise same VLAN (IPv4 and IPv6) to upstream T1s. Each T1 will see there are 2 available next hops for the VLAN. T1s advertise to T2 as normal.

  • Data Plane

    • Traffic to the server
      • Traffic lands on any of the T1 by ECMP from T2s.
      • T1 forwards traffic to either of the T0s by ECMP.
      • T0 sends the traffic to the server and NIC delivers traffic up the stack.
    • Traffic from the server to outside the cluster
      • NIC determines which link to use and sends all the packets on a flow using the same link.
      • T0 sends the traffic to the T1 by ECMP.
    • Traffic from the server to within the cluster
      • NIC determines which link to use and sends all the packet on a flow using the same link.
      • T0 sends the traffic to destination server if T0 has learnt the MAC address of the destination server.

3.1.2 Server Uplink Issue

Both T0s are up and functioning and some servers NIC are only connected to 1 ToR (due to cable issue, or the cable is taken out for maintenance).

  • Control Plane
    No change from the normal case.
  • Data Plane
    • Traffic to the server
      • Traffic lands on any of the T1 by ECMP from T2s.
      • T1 forwards traffic to either of the T0s by ECMP.
      • If T0 does not have the downlink to the server, T0 will send the traffic to the peer T0 over IPinIP encap via T1s.
      • T0 sends the traffic to the server and NIC delivers traffic up the stack.
    • Traffic from the server to outside the cluster
      • T0 will signal to NIC which side to use.
      • NIC determines which link to use and sends all the packets on a flow using the same link. If server NIC has only 1 connection up, all traffic will be on this connection.
      • T0 sends the traffic to the T1 by ECMP
    • Traffic from the server to within the cluster
      • T0 will signal to NIC which side to use.
      • NIC determines which link to use and sends all the packets on a flow using the same link. If Server NIC has only 1 connection up, all traffic will be on this connection
      • If T0 does not have the downlink to the server, T0 will send the traffic to the peer T0 over IPinIP encap via T1s.
      • T0 sends the traffic to the server.

3.1.3 ToR Failure

Only 1 T0s is up and functioning.

  • Control Plane
    Only 1 T0 will advertise the VLAN (IPv4 and v6) to upstream T1s.
  • Data Plane
    • Traffic to the server
      • Traffic lands on any of the T1 by ECMP from T2s.
      • T1 forwards traffic to either of the T0s by ECMP. If one T0 is down, T1 forwards traffic to the healthy one.
      • T0 sends the traffic to the server.
    • Traffic from the server to outside the cluster
      • T0 will signal to NIC which side to use.
      • T0 sends the traffic to the T1 by ECMP.
    • Traffic from the server to within the cluster
      • T0 will signal to NIC which side to use.
      • T0 sends the traffic to the server.

3.1.4 Comparison to Active-Standby

Highlight on the common and differences with Active-Standby:

Active- Standby Active-Active Implication
Server uplink view Single IP, single MAC
Standby side receive traffic Forward it to active ToR through IPinIP tunnel via T1
T0 to T1 control plane Advertise same set of routes
T1 to T0 Traffic ECMP
Southbound traffic From either side
Northbound traffic All is duplicated to both ToRs. NiC determines which side to forward the traffic. Orchagent doesn’t need to drop packets on standby side.
Bandwidth Up to 1 link Up to 2 links T1 and above devices see more throughput from server.
Cable Control I2C gRPC over DAC cables Control plane and data plane now share the same link.

3.2 DB Schema Changes

3.2.1 Config DB

  • New field in MUX_CABLE table to determine cable type
MUX_CABLE|PORTNAME:
  cable_type: active-standby|active-active

3.2.2 App DB

  • New table to invoke transceiver daemon to query server side forwarding state
FORWARDING_STATE_COMMAND | PORTNAME:
  command: probe | set_active_self | set_standby_self | set_standby_peer 
FORWARDING_STATE_RESPONSE | PORTNAME:
  response: active | standby | unknown | error 
  response_peer: active | standby | unknown | error 
  • New table for transceiver daemon to write peer link state to linkmgrd
PORT_TABLE_PEER|PORTNAME
  oper_status: up|down
  • New table to invoke transceiver daemon to set peer's server side forwarding state
HW_FORWARDING_STATE_PEER|PORTNAME
  state: active|standby|unknown 

3.2.3 State DB

  • New table for transceiver daemon to write peer's server side forwarding state to linkmgrd
HW_MUX_CABLE_TABLE_PEER| PORTNAME
 state: active |standby|unknown

3.3 Linkmgrd

Linkmgrd will provide the determination of a ToR / link's readiness for use.

3.3.1 Link Prober

Linkmgrd will keep the link prober design from active-standby mode for monitoring link health status. Link prober will send ICMP packets and listen to ICMP response packets. ICMP packets will contain payload information about the ToR. ICMP replies will be duplicated to both ToRs from the server, hence a ToR can monitor the health status of its peer ToR as well.

Link Prober will report 4 possible states:

  • LinkProberUnknown: Serves as initial states. This state is also reachable in the case of no ICMP reply is received.
  • LinkProberActive: It indicates that LinkMgr receives ICMP replies containing ID of the current ToR.
  • LinkProberPeerUnknown: It indicates that LinkMgr did not receive ICMP replies containing ID of the peer ToR. Hence, there is a chance that peer ToR’s link is currently down.
  • LinkProberPeerAcitve: It indicates that LinkMgr receives ICMP replies containing ID of the peer ToR, or in other words, peer ToR’s links appear to be active.

By default, the heartbeat probing interval is 100 ms. It takes 3 lost of link prober packets, to determine link is unhealthy. Server issue can also cause link prober packet loss, but ToR won't distinguish it from link issue.

ICMP Probing Format
The source MAC will be ToR's SVI mac address. Ethernet destination will be the well-known MAC address. Source IP will be ToR's Loopback IP, destination IP will be SoC's IP address, which will be introduced as a field in minigraph.
icmp_format

Linkmgrd also adapt TLV (Type-Length-Value) as the encoding schema in payload for additional information elements, including cookie, version, ToR GUID etc.
icmp_payload

3.3.2 Link State

When link is down, linkmgrd will receive notification from SWSS based on kernel message from netlink. This notification will be used to determine if ToR is healthy.

3.3.3 Forwarding State

Admin Forwarding State
ToRs will signal NIC if the link is active / standby, we will call this active / standby state as admin forwarding state. It's up to NIC to determine which link to use if both are active, but it should never choose to use a standby link. This logic provides ToR more control over traffic forwarding.

Operational Forwarding State
Server side should maintain an operational forwarding state as well. When link is down, eventually admin forwarding state will be updated to standby. But before that, if server side detects link down, it should stop sending traffic through this link even the admin state is active. In this way, we ensure the ToRs have control over traffic forwarding, and also guarantee immediate reaction when link state is down.

3.3.4 Acitve-Active State Machine

Active-acitve state transition logics are simplified compared to active-standby. In active-standby, linkmgrd makes mux toggle decisions based on y-cable direction, while for active-active, two links are more independent. Linkmgrd will only make state transition decisions based on healthy indicators.

To be more specific, if link prober indicates active AND link state appears to be up, linkmgrd should determine link's forwarding state as active, otherwise, it should be standby.

active_active_self

Linkmgrd also provides rescue mechanism when peer can't switch to standby for some reason, i.e. link failures. If link prober doesn't receive peer's heartbeat response AND self ToR is in healthy active state, linkmgrd should determine peer link to be standby. active_active_peer

When control channel is unreachable, ToR won't block the traffic forwarding, but it will periodically check gRPC server's healthiness. It will make sure server side's admin forwarding state aligns with linkmgrd's decision. grpc_failure

3.3.5 Default route to T1

If default route to T1 is missing, dual ToR system can suffer from northbound packet loss, hence linkmgrd also monitors defaul route state. If default route is missing, linkmgrd will stop sending ICMP probing request and fake an unhealthy status. This functionality can be disabled as well, the details is included in default_route.

To summarize the state transition decision we talk about, and the corresponding gRPC action to take, we have this decision table below:

Input Decision
Default Route to T1 Link State Link Prober Link Manager State gRPC Action to Update Server-Side Admin Forwarding State
SELF PEER SELF PEER
Available Up Active Active Active Set to Active No-op
Available Active Unknown Set to standby
Available Up Unknown * Standby Set to standby No-op
Available Down * * Standby Set to standby No-op
Missing * * * Standby Set to standby No-op

3.3.6 Incremental Featrues

  • Link Prober Packet Loss Statics
    Link prober will by default send heartbeat packet every 100 ms, the packet loss statics can be a good measurement of system healthiness. An incremental feature is to collect the packet loss counts, start time and end time. The collected data is stored and updated in state db. User can check and reset through CLI.

  • Supoort for Detachment
    User can config linkmgrd to a certain mode, so it won't switch to active / standby based on health indicators. User can also config linkmgrd to a mode, so it won't modify peer's forwarding state. This support will be useful for maintenance, upgrade and testing scenarios.

3.4 Orchagent

3.4.1 IPinIP tunnel

Orchagent will create tunnel at initialization and add / remove routes to forward traffic to peer ToR via this tunnel when linkmgrd switchs state to standby / active.

Check below for an example of config DB entry and tunnel utilization when LT0's link is having issue. tunnel

3.4.2 Flow Diagram and Orch Components

Major components of Orchagent for this IPinIP tunnel are MuxCfgOrch, TunnelOrch, MuxOrch. tunnel

  1. MuxCfgOrch
    MuxCfgOrch listens to config DB entries to populate the port to server IP mapping to MuxOrch.

  2. TunnelOrch
    TunnelOrch will subscribe to MUX_TUNNEL table and create tunnel, tunnel termination, and decap entry. This tunnel object would be created when initializing. This tunnel object would be used as nexthop object by MuxOrch for programming route via SAI_NEXT_HOP_TYPE_TUNNEL_ENCAP.

  3. MuxOrch
    MuxOrch will listen to state changes from linkmgrd and does the following at a high-level:

    • Enable / disable neighbor entry.
    • Add / remove tunnel routes.

3.5 Transceiver Daemon

3.5.1 Cable Control through gRPC

In active-active design, we will use gRPC to do cable control and signal NIC if ToRs is up active. SoC will run a gRPC server. Linkmgrd will determine server side forwarding state based on link prober status and link state. Then linkmgrd can invoke transceiver daemon to update NIC wether ToRs are active or not through gRPC calls.

Current defined gRPC services between SoC and ToRs related with linkmgrd cable controlling:

  • DualToRActive
    1. Query forwarding state of ports for both peer and self ToR;
    2. Query server side link state of ports for both peer and self ToR;
    3. Set forwarding states of ports for both peer and self ToR;
  • GracefulRestart
    1. Shutdown / restart notification from SoC to ToR.

3.6 State Transition Flow

The following UML sequence illustrates the state transition when linkmgrd state moves to active. The flow will be similar for moving to standby.

state transition flow

3.7 Traffic Forwarding

The following shows the traffic forwarding behaviors:

  • both ToRs are active.
  • one ToR is active while the another ToR is standby.

Traffic Forwarding

3.7.1 Special Cases of Traffic Forwarding

3.7.1.1 gRPC Traffic to the NiC IP

There is a scenario that, if the upper ToR enters standby when its peer(the lower ToR) is already in standby state, all downstream I/O from ToR A will be forwarded through the tunnel to the peer ToR(the lower ToR), so does the control plane gRPC traffic from the transceiver daemon. As the lower ToR is in standby, those tunneled I/O will be blackholed, the NiC will never know that the upper ToR has entered standby in this case.

To solve this issue, we want the control plane gRPC traffic from the transceiver daemon to be forwarded directly via the local devices. This is to differentiate the control plane traffic to the NiC IPs from dataplane traffic that its forwarding behavior honors the mux state and be forwarded to the peer active ToR via the tunnel when the port comes to standby.

The following shows the traffic forwarding behavior when the lower ToR is active while the upper ToR is standby. Now, gRPC traffic from the standby ToR(Upper ToR) is forwarded to the NiC directly. The downstream dataplane traffic to the Upper ToR are directed to the tunnel to the active Lower ToR.



When orchagent is notified to change to standby, it will re-program both the ASIC and the kernel to let both control plane and data plane traffic be forwarded via the tunnel. To achieve the design proposed above, MuxOrch now will be changed to skip notifying the Tunnelmgrd if the neighbor address is the NiC IP address, so Tunnelmgrd will not re-program the kernel route in this case and the gRPC traffic to the NiC IP address from the transceiver daemon will be forwarded directly.

The following UML diagram shows this change when Linkmgrd state moves to standby:

message change flow

3.8 Enhancements

3.8.1 Advertise updated routes to T1

Current failover strategy can smoothly handle the link failure cases, but if one of the ToRs crashes, and if T1 still sends traffic to the crashed ToR, we will see packet loss.

A further improvement in rescuing scenario, is when detecting peer's unhealthy status, local ToR advertises specific routes (i.e. longer prefix), so that traffic from T1 does't go to crashed ToR as all.

3.8.2 Server Servicing & ToR Upgrade

For server graceful restart, We already have gRPC service defined in 3.5.1. An indicator of ongoing server servicing should be defined based on that notification, so ToR can avoid upgrades in the meantime. Vice versa, we can also define gRPC APIs to notify server when ToR upgrade is ongoing.

3.8.3 BGP update delay

When the BGP neighbors are started on an active-active T0 switch, the T0 will try to establish BGP sessions with its connected T1 switches. After the BGP sessions' establishment, the T0 will exchange routes with those T1s. T1 switches usually have more routes than the T0 so T1 switches take more time to process out routes before sending updates. The consequence is that, after BGP sessions’ establishment, T1 switches could receive BGP updates from the T0 before the T0 receives any BGP updates from the T1s. There will be a period that those T1s have routes learnt from the T0 but the T0 has no routes learnt from the T1(T0 has no default routes). In this period, Those T1s could send downstream traffic to this T0, as stated in 3.3.5, the T0 is still in standby state, it will try to forward the traffic via the tunnel. As the T0 has no default route in this period, those traffic will be blackholed.

So for the active-active T0s, a BGP update delay of 10 seconds is introduced to the BGP configurations to postpone sending BGP update after BGP session establishment. In this case, the T0 could learn routes from the T1s before the T1s learn any routes from the T0. So when the T1 could send any downstream traffic to the T0, the T0 will have default routes ready.

3.8.4 Skip adding ingress drop ACL

Previously, at a high level, when the mux port comes to standby, the MuxOrch add ingress ACL to drop packets on the mux port. And when the mux port comes to active, the MuxOrch remove the ingress ACL. As described in [3.6], the MuxOrch is acted an intermediate agent between LinkMgrd and the transceiver daemon. Before the NiC receives gRPC request to toggle standby, the ingress drop ACL has already been programmed by MuxOrch. In this period, the server NiC still regard this ToR as active and could send upstream traffic to this ToR, but the upstream traffic will be dropped by the installed ingress drop ACL rule.

A change to skip the installation of ingress drop ACL rule when toggling standby is introduced to forward the upstream traffic with best effort. This is because that, though the mux port is already in standby state in this period, the removal of the ingress drop ACL could allow the upstream traffic to reach the ToR and to be possibly forwarded by the ToR.

3.9 Command Line

This part only covers the command lines and options for active-active dualtor.

3.9.1 Show mux status

show mux status returns the mux status for mux ports:

  • PORT: mux port name
  • STATUS: current mux status, could be either active or standby
  • SERVER_STATUS: the mux status read from mux server as the result of last toggle
    • active: mux server returned active as the result of last toggle to active
    • standby: mux server returned standby as the result of last toggle to standby
    • unknown: last toggle failed to switch the mux server status, or failed to read the status from the mux server
    • error: last toggle failed to switch the orchagent status
  • HEALTH: mux port health
    • healthy: it means that the ToR could receive link probe replies from the mux server, the following conditions must be satisfied for a mux port to be healthy:
      • port status is up
      • could receive replies for self link probes
      • current mux status(STATUS) should match server status(SERVER_STATUS) or server status is unknown
      • default route to T1s is present
    • unheathy: any of the above healthy conditions is broken
  • HWSTATUS: check if current mux status matches server status
    • consistent: STATUS matches SERVER_STATUS
    • inconsistent: STATUS doesn't matches SERVER_STATUS
    • absent: SERVER_STATUS is not present
  • LAST_SWITCHOVER_TIME: last switchover timestamp
$ show mux status
PORT        STATUS    SERVER_STATUS    HEALTH    HWSTATUS    LAST_SWITCHOVER_TIME
----------  --------  ---------------  --------  ----------  ---------------------------
Ethernet4   active    active           healthy   consistent  2023-Mar-27 07:57:43.314674
Ethernet8   active    active           healthy   consistent  2023-Mar-27 07:59:33.227819

3.9.2 Show mux config

show mux config returns the mux configurations:

  • SWITCH_NAME: peer switch hostname
  • PEER_TOR: peer switch loopback address
  • PORT: mux port name
  • state: mux mode configuration
    • auto: enable failover logics for both self and peer
    • manual: disable failover logics for both self and peer
    • active: if current mux status is not active, toggle the mux to active once, then work in manual mode
    • standby: if current mux status is not standby, toggle the mux standby once, then work in manual mode
    • detach: enable failover logics only for self
  • ipv4: mux server ipv4 address
  • ipv6: mux server ipv6 address
  • cable_type: mux cable type, active-active for active-active dualtor
  • soc_ipv4: soc ipv4 address
$ show mux config
SWITCH_NAME        PEER_TOR
-----------------  ----------
lab-switch-2  10.1.0.33
port        state    ipv4             ipv6               cable_type     soc_ipv4
----------  -------  ---------------  -----------------  -------------  ---------------
Ethernet4   auto     192.168.0.2/32   fc02:1000::2/128   active-active  192.168.0.3/32
Ethernet8   auto     192.168.0.4/32   fc02:1000::4/128   active-active  192.168.0.5/32

3.9.3 Show mux tunnel-route

show mux tunnel-route returns tunnel routes that have been created for mux ports.

For each mux port, there can be 3 entries: server_ipv4, server_ipv6, soc_ipv4. For each entry, if tunnel route is created in kernel or asic, you will see added in command output, if not, you will see -. If no tunnel route is created for any of the 3 entries, mux port won't show in the command output.

  • Usage:
show mux tunnel-route [OPTIONS] <port_name>

show muxcable tunnel-route <port_name>
  • Options:
--json          display the output in json format
  • Example
$ show mux tunnel-route Ethernet44
PORT        DEST_TYPE    DEST_ADDRESS       kernel    asic
----------  -----------  -----------------  --------  ------
Ethernet44  server_ipv4  192.168.0.22/32    added     added
Ethernet44  server_ipv6  fc02:1000::16/128  added     added
Ethernet44  soc_ipv4     192.168.0.23/32    -         added

3.9.4 Config mux mode

config mux mode configures the operational mux mode for specified port.

# config mux mode <operation_status> <port_name>

argument "<operation_status>" is  choose from:
        active,
        auto,
        manual,
        standby,
        detach.

4 Warm Reboot Support

TBD