Still limitations for a dual stack Talos cluster

## Bug Report

Running a dual stack cluster with Talos is still problematic and documentation is still short.

### Description

The last [bug report](https://github.com/siderolabs/talos/issues/8115) I created turned to a nice opportunity for many to contribute about how to achieve dual stacking with Talos. Thanks to every one who contributed to that one.

Still, there are more limitations and I thought about creating this new issue, hoping it will also help improving documentation and push development towards a better solution for dual stacking.

A-Talos's VIP is single stack 

In machineconfig, you can create a VIP that will be shared by control planes to avoid the need of an external load balancer. Unfortunately, that option is single stack. If you put an IPv4 and an IPv6 address, the config fails parsing.

```
machine:
  network:
    interfaces:
    - interface: eth0
      vip:
        ip: 1.2.3.4,2001:0DB8:1234:56::30
```
Fails because the value does not parse to a valid IP address.

```
machine:
  network:
    interfaces:
    - interface: eth0
      vip:
        ip: 1.2.3.4
        ip: 2001:0DB8:1234:56::30
```
Fails because there are two IP lines.

```
machine:
  network:
    interfaces:
    - interface: eth0
      vip:
        ip: 2001:0DB8:1234:56::30
```
or
```
machine:
  network:
    interfaces:
    - interface: eth0
      vip:
        ip: 1.2.3.4
```
Will work but you will be single stack.

Another point here is that [Talos docs recommends not to use that VIP as the endpoint for talosctl](https://www.talos.dev/v1.9/talos-guides/network/vip/#caveats). Here, I created a DNS name that resolves to each of control planes' IPs. Unfortunately, should the first one fail, the command will exit and will not try the others. Because that mechanism will fail to pass even from one IPv4 to another, of course it will also fail to pass from an IPv4 to a v6. So here again, Talos is single stack only.

B-Dual stack IPv6 / IPv4 is not the same as dual stack IPv4 / IPv6

In [this post](https://github.com/siderolabs/talos/issues/8115#issuecomment-2307280220), @BernardGut explained that the ordering of the IP ranges in the config was important. If the order does not match from one setting to the other, the config will fail. It is also important to understand, explain and document that If IPv4 is listed first, the dual stack cluster will be IPv4 / IPv6. If IPv6 is listed first, the cluster will turn to IPv6 / IPv4. Here, I am deploying Longhorn as a CSI. That one does not support IPv6 and its frontend service must be IPv4. Still, its helm chart does not specify it and the service ends up single stack using the first IP family used by the cluster. For that reason, you will not reach Longhorn's UI if you are single stack v6 or dual stack v6 / v4. Again, nowhere I saw any documentation about this kind of impact.

C-And some other minor points...

As for the service subnet, I am the one who suspected that /112 was the maximum but indeed, [/108 is fine](https://github.com/siderolabs/talos/issues/8115#issuecomment-2245030646) and I'm using that myself. Thanks to @nazarewk for that one and forget about my /112.

Because I ended up forced to run IPv4 / IPv6 to accommodate Longhorn, I chose to re-enable kubeprism. The cluster being IPv4 first, I did not detect any problem with it. Still, I am unable to prove that it fully works as expected or if it just ends up surviving like Longhorn's frontend, saved by the fact that the cluster is IPv4 first hand.

So indeed, there are ways to run a dual stack cluster with Talos but there are still many things that are obscure, uncertain or even confirmed as non-functional.

Overall, I am using these files :

```
machine:
  install:
    extraKernelArgs:
      - net.ifnames=0
  sysctls:
      vm.nr_hugepages: "2048"
  time:
      disabled: false # Indicates if the time service is disabled for the machine.
      servers:
          - time.localdomain
      bootTimeout: 2m0s # Specifies the timeout when the node time is considered to be in sync unlocking the boot sequence.
  kubelet:
    nodeIP:
      validSubnets:
        - 172.24.136.128/26
        - 2001:0DB8:1234:c0::/64
```
```
cluster:
  apiServer:
    extraArgs:
      bind-address: "::"
  controllerManager:
    extraArgs:
      bind-address: "::1"
      node-cidr-mask-size-ipv6: "80"
  scheduler:
    extraArgs:
      bind-address: "::1"
  network:
    podSubnets:
      - 10.244.0.0/16
      - 2001:0DB8:1234:c1::/64
    serviceSubnets:
      - 10.96.0.0/12
      - 2001:0DB8:1234:c3::10:0/108
  etcd:
    advertisedSubnets:
      - 172.24.136.128/26
      - 2001:0DB8:1234:c0::/64
  proxy:
    extraArgs:
      ipvs-strict-arp: true
```
For control planes, adding this:
```
machine:
  network:
    interfaces:
    - interface: eth0
      vip:
        ip: 2001:0DB8:1234:c0::30
  certSANs:
      - k64ctl.localdomain
      - kube64-ctl.localdomain
      - kube64-c1.localdomain
      - kube64-c2.localdomain
      - kube64-c3.localdomain
      - 172.24.136.161
      - 172.24.136.162
      - 172.24.136.163
      - 2001:0DB8:1234:c0::30
      - 2001:0DB8:1234:c0::31
      - 2001:0DB8:1234:c0::32
      - 2001:0DB8:1234:c0::33
      - ::1
```
each control plan has its unique file :
```
machine:
  network:
    hostname: kube64-c1.localdomain
    interfaces:
      - interface: eth0
        addresses:
          - 172.24.136.161/26
          - 2001:0DB8:1234:c0::31/64
        routes:
          - network: 0.0.0.0/0
            gateway: 172.24.136.129
    nameservers:
      - 172.24.136.132
      - 172.24.136.135
```
just like each worker :
```
machine:
  disks:
    - device: /dev/sdb
      partitions:
        - mountpoint: /var/mnt/sdb
  kubelet:
      extraMounts:
          - destination: /var/mnt/sdb
            type: bind
            source: /var/mnt/sdb
            options:
              - bind
              - rshared
              - rw
  network:
    hostname: kube64-w1.localdomain
    interfaces:
      - interface: eth0
        addresses:
          - 172.24.136.164/26
          - 2001:0DB8:1234:c0::41/64
        routes:
          - network: 0.0.0.0/0
            gateway: 172.24.136.129
    nameservers:
      - 172.24.136.132
      - 172.24.136.135
```


### Logs

### Environment

- Talos version: 1.9.4
- Kubernetes version: 1.32.2
- Platform: VMs built from ISO running in Proxmox 8.3


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Still limitations for a dual stack Talos cluster #91

Bug Report

Description

Logs

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Still limitations for a dual stack Talos cluster #91

Description

Bug Report

Description

Logs

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions