Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New link struct - Vxlan #1532

Merged
merged 44 commits into from
Oct 6, 2023
Merged

New link struct - Vxlan #1532

merged 44 commits into from
Oct 6, 2023

Conversation

steiler
Copy link
Collaborator

@steiler steiler commented Aug 18, 2023

Implement VxLAN as a struct that implements the links.Link interface.
This is then meant to be utilized via the tools command, but will also allow for the definition of VxLAN links inbetween containerlab nodes via the topology file.

@steiler steiler marked this pull request as ready for review August 30, 2023 13:47
@steiler
Copy link
Collaborator Author

steiler commented Aug 30, 2023

@hellt this one is fine to get reviewed
It also contains tests and docs already

@steiler steiler requested a review from hellt August 30, 2023 13:59
@steiler
Copy link
Collaborator Author

steiler commented Sep 6, 2023

The tools command still needs to be migrated.

@hellt
Copy link
Member

hellt commented Sep 28, 2023

note to self:

  • check that we use 14789 port by default instead of 4789 as the latter is often blocked
  • check that we expose an option to change the port (both in link definition and tools cmd)
  • make sure to create altname for the vxlan link that is longer than 15 chars

nodes/srl/srl.go Outdated
}
iface.Mtu = mtu
}
iface.Mtu = e.GetLink().GetMtu()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is what causes config to break now in the srl tests in CI

two issues here

  1. when we use srl node kind the errors from srl postdeploy are not visible (need to understand why)
  2. when changed to nokia_srlinux kind, the following becomes evident:
ERRO[0017] failed to run postdeploy task for node srl1: command execution error:At line 10: Error: Error in path: .interface{.name=="ethernet-1/1"}.mtu
    [InvalidArgument] mtu 9500 out of range [1500-9412]
 
ERRO[0017] failed to run postdeploy task for node srl2: command execution error:At line 13: Error: Error in path: .interface{.name=="ethernet-1/1"}.mtu
    [InvalidArgument] mtu 9500 out of range [1500-9412]

the default 9500 MTU shouldn't be used in the srl config, as the 9500 link mtu is a veth mtu only, not for config in NOSes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in b51def3 I changed setting the mtu value in the template endpoints struct only if it is not the default veth mtu value.

Copy link
Member

@hellt hellt Oct 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here is the issue with SR Linux and low MTU interfaces.
SR Linux emulates BCOM datapath, and port mtu on recent hw platforms can't be less that 1500B.
When using vxlan interfaces we ultimately rely on the MTU of the outgoing (parent) interfaces, which is often configured with 1500B itself.
This leaves vxlan interface with 1450B MTU which is too low, and can't be configured on the SR Linux platform.

Because of that, using vlxan interfaces inside the SRL netns can only be done if the parent interface of the vlxan tunnel has 1550B+ MTU.

In practice with connection to virtual labs this likely means that we will still use more of the vxlan-stitched mode, rather than vxlan-embed.

Comment on lines 1 to 17
name: vxlan

topology:
nodes:
s1:
kind: srl
image: ghcr.io/nokia/srlinux
startup-config: 01-vxlan-s1.config
links:
- type: vxlan
endpoint:
node: s1
interface: e1-1
mac: 02:00:00:00:00:04
remote: 192.168.66.1
vni: 100
udp-port: 5555
Copy link
Member

@hellt hellt Oct 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of this lab we can use the following:

name: vxlan-embed

topology:
  nodes:
    l1:
      kind: linux
      image: alpine:3
      exec:
        - ip addr add dev eth1 192.168.0.1/30
      mgmt-ipv4: 172.20.20.91
    l2:
      kind: linux
      image: alpine:3
      exec:
        - ip addr add dev eth1 192.168.0.2/30
      mgmt-ipv4: 172.20.20.92

  links:
    - type: vxlan
      endpoint:
        node: l1
        interface: l1eth1
      remote: 172.20.20.92
      vni: 100
    - type: vxlan
      endpoint:
        node: l2
        interface: l2eth1
      remote: 172.20.20.91
      vni: 100

That way you don't have to touch the host at all (and clean it up, which is not happening currently)

Note, that this lab can't be deployed today, some error about the same file being present...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at the code it seems this lab is not possible to deploy. Since we have to find the outgoing interface (I wonder if this is a mandatory step?) we rely on the host's routing stack. It makes sense, since in real life applications, the tunnels are established between host nodes, not from within the containers, as in the lab above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not possible to deploy this one, since vxlan links created in a single netns can't have the same VNI.
I will rework this lab

Copy link
Member

@hellt hellt Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is still worth looking into this closer. Could be alpine issue or smth. Everything below this line is an early observation.

This lab works, but with prereq.
When the vlxan interface is first created in the host netns and then moved to the container netns, it needs to initiate connection towards the remote first.
If the remote end (l2) starts to ping l1 first, it won't succeed. l1 needs to start pinging first.

name: vxlan-embed

topology:
  nodes:
    l1:
      kind: linux
      image: alpine:3
      exec:
        - ip addr add dev eth1 192.168.0.1/30
      mgmt-ipv4: 172.20.20.91
    l2:
      kind: linux
      image: alpine:3
      exec:
        - >
          ash -c '
          apk add iproute2 &&
          ip link add name vxlan0 type vxlan id 100 remote 172.20.20.91 dstport 14789 &&
          ip l set dev vxlan0 up &&
          ip addr add dev vxlan0 192.168.0.2/30'
      mgmt-ipv4: 172.20.20.92

  links:
    - type: vxlan
      endpoint:
        node: l1
        interface: eth1
      remote: 172.20.20.92
      vni: 100

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is something strange happening with vxlan tunnels moved to container netns.

I have built a topology that it's deployed in two different hosts like this:

#host1
name: vxlan-embed
prefix: ""

mgmt:
  network: fixedips
  ipv4_subnet: 100.103.1.0/24

topology:
  kinds:
    linux:
      image: ghcr.io/hellt/network-multitool

  nodes:
    node1:
      kind: linux
      mgmt_ipv4: 100.103.1.11
      exec:
        - ip addr add dev eth1 192.168.0.1/30
  
  links:
    - type: vxlan
      endpoint:
        node: node1
        interface: eth1
        mac: 02:00:00:00:00:02
      remote: 100.103.2.22
      vni: 100
      udp-port: 4789
#host2
name: vxlan-embed
prefix: ""

mgmt:
  network: fixedips
  ipv4_subnet: 100.103.2.0/24

topology:
  kinds:
    linux:
      image: ghcr.io/hellt/network-multitool

  nodes:
    node2:
      kind: linux
      mgmt_ipv4: 100.103.2.22
      exec:
        - ip addr add dev eth1 192.168.0.2/30
  
  links:
    - type: vxlan
      endpoint:
        node: node2
        interface: eth1
        mac: 02:00:00:00:00:01
      remote: 100.103.1.11
      vni: 100
      udp-port: 4789

ping from 192.168.0.1 to 192.168.0.2 and vice versa does not work.

The first thing I see is that vxlan tunnels are created with the “nolearning” flag. This disables the snooping of incoming packets that builds MAC to VTEP address fdb table.

2829: eth1@if2829: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 02:00:00:00:00:01 brd ff:ff:ff:ff:ff:ff link-netnsid 0 promiscuity 0 minmtu 68 maxmtu 65535
    vxlan id 100 remote 100.103.6.11 dev if30 srcport 0 0 dstport 4789 nolearning ttl auto ageing 300 udpcsum noudp6zerocsumtx noudp6zerocsumrx numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
    inet 192.168.0.1/30 scope global eth1

If we look at fdbs in node1 and node2:

host1:
00:00:00:00:00:00 dev eth1 dst 100.103.2.22 via ifindex 9 link-netnsid 0 self permanent

host2:
00:00:00:00:00:00 dev eth1 dst 100.103.1.11 via ifindex 30 link-netnsid 0 self permanent

BUM traffic will be delivered to those dst IP addresses.
These ifindexes (Nexthops) are root namespace interfaces. They are the interfaces that are used to reach destination IP tunnel. It can be the mgmt. bridge or other interface in the host.

If we do a tcpdump to check vxlan traffic, we see that it’s not coming from “mgmt_ipv4” of the node but from the IP of the bridge in the host. It looks that the udp socket of the vxlan tunnel is created at the root namespace. I understand that’s the reason you also see “udp port 4789 unreachable” in the tcpdumps of container.

If I add fdb entries, using IPs in the root namepspace:

##host1
ip netns exec node1  bridge fdb append 00:00:00:00:00:00 dev eth1 self dst 100.103.2.1 vni 100 port 4789

##host2
ip netns exec node1 bridge fdb append 00:00:00:00:00:00 dev eth1 self dst 100.103.1.1 vni 100 port 4789

.. traffic starts to flow immediately.

Hope it helps.

Copy link
Member

@hellt hellt Oct 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @michelredondo
We will bring back the learning flag back.
As for the interface IP - it is indeed expected that the interface belongs to the root netns, as this is where the tunnel interface was created.

In general, I feel that moving vxlan to container netns is a bit hacky

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with the learning flag, topologies like the one I described will not work. Both containers have the “00:00:00:00:00:00” fdb entry pointing to the incorrect IP, so it will never be possible to learn MACs from the other tunnel end.

The https://github.com/srl-labs/containerlab/blob/main/tests/08-vxlan/01-vxlan.clab.yml test case works because srl1 fdb entry points to the correct IP (172.20.25.22). When you originate traffic from that side, srl1 sends BUM traffic to l2 node and l2 learns how to send traffic back.
If you try to originate traffic from l2 node, BUM traffic is sent to srl1 mgmt IP (172.20.25.21), but the udp vxlan socket is not listening there. It’s listening in root namespace (bridge IP: 172.20.25.1). Learning in this case never happens.

The only way I can think to make this work is to add an additional fdb entry pointing to the IP address in the root ns. Maybe we could add this as an optional parameter in the link definition.

links/link_vxlan.go Outdated Show resolved Hide resolved
links/link_vxlan.go Outdated Show resolved Hide resolved
tests/08-vxlan/01-vxlan.robot Outdated Show resolved Hide resolved
links/link.go Outdated
Comment on lines 369 to 384
// rename the link created with random name if its length is acceptable by linux
if len(endpt.GetIfaceName()) < 15 {
err := netlink.LinkSetName(l, endpt.GetIfaceName())
if err != nil {
return fmt.Errorf(
"failed to rename link: %v", err)
}
} else {
// else we set the desired long name as alias
// in future we need to set it as an alternative name,
// pending https://twitter.com/ntdvps/status/1709580216648024296
err := netlink.LinkSetAlias(l, endpt.GetIfaceName())
if err != nil {
return fmt.Errorf(
"failed to add alias: %v", err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a ~big change I made to the original proposal.
Instead of creating some hashed name for the interface I propose the following:

  1. if an interface name is 14+ chars we use the random name that has already been generated by clab and use it
  2. the long name we set as an alias for the interface. The alias is like a description of the interface that can container 128 chars. The alias though doesn't let you use ip link show <alias>.
  3. I am waiting on this to merge Add support for alternative names vishvananda/netlink#862 to use alternative name instead of the alias. The altname allows users to use it in ip utility.

Comment on lines -666 to -694
// VethCleanup iterates over links found in clab topology to initiate removal of dangling veths
// in host networking namespace or attached to linux bridge.
// See https://github.com/srl-labs/containerlab/issues/842 for the reference.
func (c *CLab) VethCleanup(ctx context.Context) error {
hostBasedEndpoints := []links.Endpoint{}

// collect the endpoints of regular nodes
for _, n := range c.Nodes {
if n.Config().IsRootNamespaceBased || n.Config().NetworkMode == "host" {
hostBasedEndpoints = append(hostBasedEndpoints, n.GetEndpoints()...)
}
}

// collect the endpoints of the fake nodes
hostBasedEndpoints = append(hostBasedEndpoints, links.GetHostLinkNode().GetEndpoints()...)
hostBasedEndpoints = append(hostBasedEndpoints, links.GetMgmtBrLinkNode().GetEndpoints()...)

var joinedErr error
for _, ep := range hostBasedEndpoints {
// finally remove all the collected endpoints
log.Debugf("removing endpoint %s", ep.String())
err := ep.Remove()
if err != nil {
joinedErr = errors.Join(joinedErr, err)
}
}

return joinedErr
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to find a solution for this removed func. While working on the vlxan-stich I see that when I remove the lab, the host link remains.
This was the reason to have this func.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Remove() function is implemented on the vxlan link and forwarded to the local endpoint. The local endpoint is actually a generic endpoint. My thinking as of now is, that the removal is triggered but executed in the wrong namespace, that of the container not in the host namespace.
Let me verify.

Comment on lines 78 to 116
ip := net.ParseIP(lr.Remote)
// if the returned ip is nil, an error occured.
// we consider, that we maybe have a textual hostname
// e.g. dns name so we try to resolve the string next
if ip == nil {
ips, err := net.LookupIP(lr.Remote)
if err != nil {
return nil, err
}

// prepare log message
sb := strings.Builder{}
for _, ip := range ips {
sb.WriteString(", ")
sb.WriteString(ip.String())
}
log.Debugf("looked up hostname %s, received IP addresses [%s]", lr.Remote, sb.String()[2:])

// always use the first address
if len(ips) <= 0 {
return nil, fmt.Errorf("unable to resolve %s", lr.Remote)
}
ip = ips[0]
}

parentIf := lr.ParentInterface

if parentIf == "" {
conn, err := rtnl.Dial(nil)
if err != nil {
return nil, fmt.Errorf("can't establish netlink connection: %s", err)
}
defer conn.Close()
r, err := conn.RouteGet(ip)
if err != nil {
return nil, fmt.Errorf("failed to find a route to VxLAN remote address %s", ip.String())
}
parentIf = r.Interface.Name
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@steiler I think this code piece is now shared between
func (lr *LinkVxlanRaw) resolveStitchedVxlanComponent(params *ResolveParams) (*LinkVxlan, error)

and

func (lr *LinkVxlanRaw) resolveVxlan(params *ResolveParams) (Link, error)

It is a good candidate to create two common funcs in utils.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No not in utils but still in LinkVxlanRaw.
However there is a slight difference in the vxlan endpoint part.
Anyways, let me see how we can carve the common part out.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hellt, take a look at 8458f0b

Comment on lines 44 to 62
// Deploy provisions the stitched vxlan link with all its underlaying sub-links
func (l *VxlanStitched) Deploy(ctx context.Context) error {
return l.deploy(ctx, false)
}

func (l *VxlanStitched) deploy(ctx context.Context, skipVethCreation bool) error {
// deploy the vxlan link
err := l.vxlanLink.Deploy(ctx)
if err != nil {
return err
}

// the veth creation might be skipped if it already exists
if !skipVethCreation {
err = l.vethLink.Deploy(ctx)
if err != nil {
return err
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@steiler do we need the inner deploy then?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes in my opinion we do.
If you take a look at from where the deploy is called, it is

// DeployWithExistingVeth provisons the stitched vxlan link whilst the
// veth interface does already exist, hence it is not created as part of this
// deployment
func (l *VxlanStitched) DeployWithExistingVeth(ctx context.Context) error {
	return l.deploy(ctx, true)
}

// Deploy provisions the stitched vxlan link with all its underlaying sub-links
func (l *VxlanStitched) Deploy(ctx context.Context) error {
	return l.deploy(ctx, false)
}

The withExistingVeth is basically to support the tools command that we have.

@codecov
Copy link

codecov bot commented Oct 6, 2023

Codecov Report

Merging #1532 (6d6ebb8) into main (acba717) will increase coverage by 0.82%.
Report is 8 commits behind head on main.
The diff coverage is 59.35%.

❗ Current head 6d6ebb8 differs from pull request most recent head 19dc2f1. Consider uploading reports for the commit 19dc2f1 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1532      +/-   ##
==========================================
+ Coverage   48.58%   49.41%   +0.82%     
==========================================
  Files         133      136       +3     
  Lines       12798    13231     +433     
==========================================
+ Hits         6218     6538     +320     
- Misses       5874     5961      +87     
- Partials      706      732      +26     
Files Coverage Δ
cmd/destroy.go 69.07% <100.00%> (+2.19%) ⬆️
links/endpoint_bridge.go 48.57% <100.00%> (+6.63%) ⬆️
links/endpoint_host.go 50.00% <100.00%> (+14.28%) ⬆️
links/endpoint_raw.go 78.04% <100.00%> (+5.70%) ⬆️
links/endpoint_veth.go 100.00% <100.00%> (ø)
nodes/bridge/bridge.go 60.97% <100.00%> (ø)
clab/ovs.go 0.00% <0.00%> (ø)
links/endpoint.go 81.25% <88.88%> (+1.58%) ⬆️
clab/tc.go 0.00% <0.00%> (ø)
links/endpoint_vxlan.go 83.33% <83.33%> (ø)
... and 14 more

... and 6 files with indirect coverage changes

@hellt hellt merged commit d38bdf3 into srl-labs:main Oct 6, 2023
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants