NFS client fails to disconnect - node become unresponsive #8376

Daxcor69 · 2024-02-27T17:40:19Z

Bug Report

Description

Problem:
Nodes with iowait of 25-45% with context switching in the 30K range. Customers report painful performance loading assets.

Symptoms:
During the deletion of a statefulset backed by a volume from an external nfs server, the pod remains in a terminating state. This is NOT using the NFS provisioner (pv/pvc). The only way to remove the pod is kubectl delete pods podname-0 --force --graceful-delay=0. The pod does get removed.

spec:
  volumes:
    - name: data
      nfs:
        server: nfs1.storage.server.com
        path: /home/pete

During a node reboot these "stuck" processes are listed as "un able to terminate" but the node is eventually rebooted. IOwait and context switching goes a away.

Theory:
Even though the pod is removed from kuberentes, the linux process on the node is never terminated fully. It remains in a state such that it thinks it is waiting on data from the nfs mount like a really really big file that never finishes loading. The more of these "zombie" processes the greater the iowait on the node becomes.

Environment

Talos version: [talosctl version --nodes <problematic nodes>]
1.6.0
Kubernetes version: [kubectl version --short]
1.29.0
Platform:
proxmox ve 8.0
nfs server
ubuntu 22.04.04

The text was updated successfully, but these errors were encountered:

smira · 2024-02-28T09:28:39Z

I won't recommend to use NFS today, as it was designed for a totally different usecase.

It's not expected though to have issues as long as NFS server is still responsive. Once NFS server becomes unresponsive, things go wrong way with NFS, which can be partially mitigated with NFS mount options.

I'm not quite sure what in this issue can be attributed to Talos Linux, or anything missing in Talos Linux itself, as NFS is implemented in the kernel, and there's not much there we can do on the OS side vs. the things you can configure yourself.

Daxcor69 · 2024-02-28T15:51:43Z

When I asked about this in Discord, I got the following message "there's a problem with NFSv4 due to missing statsd if I remember correctly". Does this mean v4 is not supported in Talos?

So I just trying to sort it out. I know nfs is not ideal, I get that. Prior to migrating to Talos, nfsv4 worked without the current issue I am having. So is this an issue of nfsv3?

smira · 2024-02-28T16:38:08Z

NFSv4 user-space daemons are not enabled, but I believe it won't mount simply with v4.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NFS client fails to disconnect - node become unresponsive #8376

NFS client fails to disconnect - node become unresponsive #8376

Daxcor69 commented Feb 27, 2024

smira commented Feb 28, 2024

Daxcor69 commented Feb 28, 2024

smira commented Feb 28, 2024

NFS client fails to disconnect - node become unresponsive #8376

NFS client fails to disconnect - node become unresponsive #8376

Comments

Daxcor69 commented Feb 27, 2024

Bug Report

Description

Environment

smira commented Feb 28, 2024

Daxcor69 commented Feb 28, 2024

smira commented Feb 28, 2024