Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NFS client fails to disconnect - node become unresponsive #8376

Open
Daxcor69 opened this issue Feb 27, 2024 · 3 comments
Open

NFS client fails to disconnect - node become unresponsive #8376

Daxcor69 opened this issue Feb 27, 2024 · 3 comments

Comments

@Daxcor69
Copy link

Bug Report

Description

Problem:
Nodes with iowait of 25-45% with context switching in the 30K range. Customers report painful performance loading assets.

Symptoms:
During the deletion of a statefulset backed by a volume from an external nfs server, the pod remains in a terminating state. This is NOT using the NFS provisioner (pv/pvc). The only way to remove the pod is kubectl delete pods podname-0 --force --graceful-delay=0. The pod does get removed.

spec:
  volumes:
    - name: data
      nfs:
        server: nfs1.storage.server.com
        path: /home/pete

During a node reboot these "stuck" processes are listed as "un able to terminate" but the node is eventually rebooted. IOwait and context switching goes a away.

Theory:
Even though the pod is removed from kuberentes, the linux process on the node is never terminated fully. It remains in a state such that it thinks it is waiting on data from the nfs mount like a really really big file that never finishes loading. The more of these "zombie" processes the greater the iowait on the node becomes.

Environment

  • Talos version: [talosctl version --nodes <problematic nodes>]
    1.6.0
  • Kubernetes version: [kubectl version --short]
    1.29.0
  • Platform:
    proxmox ve 8.0
  • nfs server
    ubuntu 22.04.04
@smira
Copy link
Member

smira commented Feb 28, 2024

I won't recommend to use NFS today, as it was designed for a totally different usecase.

It's not expected though to have issues as long as NFS server is still responsive. Once NFS server becomes unresponsive, things go wrong way with NFS, which can be partially mitigated with NFS mount options.

I'm not quite sure what in this issue can be attributed to Talos Linux, or anything missing in Talos Linux itself, as NFS is implemented in the kernel, and there's not much there we can do on the OS side vs. the things you can configure yourself.

@Daxcor69
Copy link
Author

When I asked about this in Discord, I got the following message "there's a problem with NFSv4 due to missing statsd if I remember correctly". Does this mean v4 is not supported in Talos?

So I just trying to sort it out. I know nfs is not ideal, I get that. Prior to migrating to Talos, nfsv4 worked without the current issue I am having. So is this an issue of nfsv3?

@smira
Copy link
Member

smira commented Feb 28, 2024

NFSv4 user-space daemons are not enabled, but I believe it won't mount simply with v4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants