Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: btrfs unmounting + e2e tests #879

Merged
merged 10 commits into from
May 1, 2024

Conversation

jakobmoellerdev
Copy link
Contributor

@jakobmoellerdev jakobmoellerdev commented Apr 9, 2024

fix #877
partially address #876

Adds table driven tests for the xfs scenarios and runs them for a btrfs storageClass as well. also adds a minimum size for btrfs

@jakobmoellerdev
Copy link
Contributor Author

Looking at the test failures it seems the mount point is not set as expected with btrfs. Anyone has any input on that?

@llamerada-jp
Copy link
Contributor

llamerada-jp commented Apr 10, 2024

@cupnes
Is it the same situation as the following issue?
#877

@cupnes
Copy link
Contributor

cupnes commented Apr 10, 2024

@llamerada-jp
I think the situation is different from #877. What is failing now is the pod deploy. However, #877 is an issue where the pod cannot be deleted.

@jakobmoellerdev
Copy link
Contributor Author

@cupnes From what I can see in the test logs, the device verification fails: https://github.com/topolvm/topolvm/actions/runs/8615413549/job/23610904469?pr=879#step:9:570 Im not sure if this is really a failure as early as pod deployment.

@cupnes
Copy link
Contributor

cupnes commented Apr 10, 2024

@jakobmoellerdev Yes. Please wait a moment while I investigate.

@cupnes
Copy link
Contributor

cupnes commented Apr 11, 2024

@jakobmoellerdev
I changed DefaultMinimumAllocationSizeBtrfs to 110MiB and the problem area passed.

By my investigation, it seems that the minimum device size for btrfs changes dynamically [1]. So, I do not think it is necessary to support a strict minimum allocation size in this PR. We can just write it in the limitation.md for now, with another PR to deal with it if necessary.
[1] https://patchwork.kernel.org/project/linux-btrfs/patch/1403148338-6584-1-git-send-email-quwenruo@cn.fujitsu.com/

@jakobmoellerdev
Copy link
Contributor Author

Im not sure I follow on this logic. What I understand is that this would make mkfs.btrfs fail. However it seems that the pod is successfully formatted. I would have expected that mkfs.btrfs would have bailed out during MountVolume in the CSI provisioning.

That being said, im fine with removing the limit, however Im not sure how we would test BTRFS then.

@jakobmoellerdev
Copy link
Contributor Author

After adjusting the tests it seems that resizing behaves weirdly. The desired/actual storage for btrfs is off after resizing. Any ideas on what that is caused by?

@cupnes
Copy link
Contributor

cupnes commented Apr 12, 2024

@jakobmoellerdev
There is a slight difference between file systems in terms of how much the actual file system size is for the size requested by PVC.
The fifth argument of Sprintf in this line was "topolvm-provisioner" yesterday. I think the "btrfs filesystem" test was done with xfs, which is why it didn't fail like this before.

@jakobmoellerdev jakobmoellerdev force-pushed the btrfs-testing branch 2 times, most recently from 6f5f7b7 to 8321902 Compare April 15, 2024 09:41
@jakobmoellerdev
Copy link
Contributor Author

seems that now the pod deletion in btrfs for offline resizing has an issue. On top of that the runtime of the tests is pretty much unbearable. We need to figure out how to fix that otherwise the tests will run 20minutes +

I really think the tests should be refactored...

@jakobmoellerdev
Copy link
Contributor Author

Seems NodeUnpublishVolume for btrfs is broken due to "device or resource busy":

 {"level":"info","ts":"2024-04-15T11:41:37Z","logger":"driver.node","msg":"NodeUnpublishVolume called","volume_id":"52e43313-4f6b-4bdf-b3df-973f666ae983","target_path":"/var/lib/kubelet/pods/3fb3f6af-89f0-4fc4-8007-6b8263048f13/volum \u2502
\u2502 es/kubernetes.io~csi/pvc-411430a5-75d7-4aab-ae38-657b31fbe109/mount"}                                                                                                                                                                    \u2502
\u2502 {"level":"error","ts":"2024-04-15T11:41:37Z","msg":"error on grpc call","method":"/csi.v1.Node/NodeUnpublishVolume","error":"rpc error: code = Internal desc = remove dir failed for /var/lib/kubelet/pods/3fb3f6af-89f0-4fc4-8007-6b826 \u2502
\u2502 3048f13/volumes/kubernetes.io~csi/pvc-411430a5-75d7-4aab-ae38-657b31fbe109/mount: error=unlinkat /var/lib/kubelet/pods/3fb3f6af-89f0-4fc4-8007-6b8263048f13/volumes/kubernetes.io~csi/pvc-411430a5-75d7-4aab-ae38-657b31fbe109/mount: de \u2502
\u2502 vice or resource busy","stacktrace":"github.com/topolvm/topolvm/cmd/topolvm-node/app.ErrorLoggingInterceptor\n\t/home/jmoller/topolvm/cmd/topolvm-node/app/run.go:193\ngithub.com/container-storage-interface/spec/lib/go/csi._Node_Node \u2502
\u2502 UnpublishVolume_Handler\n\t/home/jmoller/go/pkg/mod/github.com/container-storage-interface/spec@v1.6.0/lib/go/csi/csi.pb.go:6168\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/home/jmoller/go/pkg/mod/google.golang.org/grpc@v1 \u2502
\u2502 .58.3/server.go:1374\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/home/jmoller/go/pkg/mod/google.golang.org/grpc@v1.58.3/server.go:1751\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.1\n\t/home/jmoller/go/pkg/mod/google. \u2502
\u2502 golang.org/grpc@v1.58.3/server.go:986"}

@jakobmoellerdev
Copy link
Contributor Author

Seems the problem here is the umount call does not go through. When debugging into the node and running umount /var/lib/kubelet/pods/3fb3f6af-89f0-4fc4-8007-6b8263048f13/volumes/kubernetes.io~csi/pvc-411430a5-75d7-4aab-ae38-657b31fbe109/mount manually then the test recovers

@jakobmoellerdev
Copy link
Contributor Author

jakobmoellerdev commented Apr 15, 2024

Seems like we have a device mapper / mounting issue again. The reason this is not recovering is because for btrfs the following /proc/mounts exists and lets this check think we do not have a mount:

/dev/dm-3 /var/lib/kubelet/pods/2be71706-9a37-41fe-91c2-f21220f5b1be/volumes/kubernetes.io~csi/pvc-411430a5-75d7-4aab-ae38-657b31fbe109/mount btrfs rw,relatime,discard=async,space_cache=v2,subvolid=5,subvol=/ 0 0

as you can see this is under /dev/dm-3, I assume this is the same root cause as #877

@jakobmoellerdev
Copy link
Contributor Author

@toshipp @cupnes In light of my recent investigation I have found that we are still using /proc/mounts instead of /proc/1/mountinfo (which would also include Maj,Min and some other neat features). Also when looking through the code I do not see why we would need our own unmount logic if there is a well written unmounter in mount-utils. Im sure there was a reason but could you explain this to me again? If we dont really have any reason to keep this around I would vote for changing this to the upstream unmounting process

@toshipp
Copy link
Contributor

toshipp commented Apr 18, 2024

@jakobmoellerdev
mount-util has not been exported during the early development stage of topolvm. We replaced our mount logic with it but missed unmount part. So, I have no objection to using it.

I checked mountinfo and found that the major number is not accurate for lvm devices. It reports 0 but should be 253.

Signed-off-by: Jakob Möller <jmoller@redhat.com>
Signed-off-by: Jakob Möller <jmoller@redhat.com>
Signed-off-by: Jakob Möller <jmoller@redhat.com>
Signed-off-by: Jakob Möller <jmoller@redhat.com>
Signed-off-by: Jakob Möller <jmoller@redhat.com>
Copy link
Contributor

@cupnes cupnes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will comment on the points I noticed. The review is not yet complete and will continue tomorrow.

internal/driver/node.go Show resolved Hide resolved
internal/filesystem/util.go Show resolved Hide resolved
Signed-off-by: Jakob Möller <jmoller@redhat.com>
Signed-off-by: Jakob Möller <jmoller@redhat.com>
Signed-off-by: Jakob Möller <jmoller@redhat.com>
Signed-off-by: Jakob Möller <jmoller@redhat.com>
@jakobmoellerdev jakobmoellerdev force-pushed the btrfs-testing branch 3 times, most recently from 19be4e0 to f6f4b9d Compare April 22, 2024 17:05
@jakobmoellerdev
Copy link
Contributor Author

I was able to reduce the test runtime of one filesystem by about 2 minutes now thanks to some missing --now flags on deleting the pods which caused a 30 period grace wait time. The reason we are now taking quite a bit of time on the resize tests is actually related to the check with df in the pod. TBH I have no idea why its taking so long and I will have to do some tracing on where its slow but im assuming it has to do with the async nature of CSI because the volume expansion on lvm was extremely quick.

Copy link
Contributor

@cupnes cupnes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. LGTM

@jakobmoellerdev
Copy link
Contributor Author

I also want to give an update on my long running but interesting investigations on the long duration of the resizing and the impact on our test suite (6+ minutes).

This is in fact caused by ControllerExpandVolume always specifying needing NodeExpandVolume

NodeExpansionRequired: true,
.

Why does this matter? Because NodeExpandVolume is not immediately called, but instead queued by the kubelets Volume Manager. When investigating I found that we are in fact not delaying the call in TopoLVM at all, rather the kubelet is not calling NodeExpandVolume for minutes, causing the test to stall until the kubelet invokes the CSI call. When investigating deeper I found that https://github.com/kubernetes/enhancements/blob/0e4d5df19d396511fe41ed0860b0ab9b96f46a2d/keps/sig-storage/556-csi-volume-resizing/README.md#expansion-on-kubelet is the design reason for this. However note that in theory, volume manager should update every 100ms, which for some reason does not happen. I know that this is where its stuck because of the PVCs conditions while waiting in the df eventually check:

│   conditions:                                                                                                                                                │
│   - lastProbeTime: null                                                                                                                                      │
│     lastTransitionTime: "2024-04-23T14:45:33Z"                                                                                                               │
│     message: Waiting for user to (re-)start a pod to finish file system resize of                                                                            │
│       volume on node.                                                                                                                                        │
│     status: "True"                                                                                                                                           │
│     type: FileSystemResizePending       

I still need to verify if there is not an issue in the kubelet that causes delays which we dont want but this is definetly the reason we see the delay, nothing in TopoLVM is working while we are in this state so there is nothing we can do to speed it up.

How could we fix this without fixing the kubelet delay? We always specify NodeExpandVolume because NodeExpandVolume is our only Code where we resize the Filesystem. This is because we do not implement NodeStageVolume and NodeUnstageVolume which could implement the mounting and unmounting seperately from NodePublish/NodeUnpublishvolume which are only run when the pod is scheduled.

Now this is important because changing this would allow us to run the filesystem resize in the lv controllers resize method, without needing to call NodeExpansionRequired true and thus we would be able to significantly speed up mounting and resizing behavior.

@cupnes
Copy link
Contributor

cupnes commented Apr 24, 2024

@jakobmoellerdev
I understood. But I was just wondering is that correct as a CSI specification? I think NodeStageVolume and NodeUnstageVolume are for mounting and unmounting volumes.

@jakobmoellerdev
Copy link
Contributor Author

I think that the main behavior of StageVolume is to prepare it for use by pods. There is no mention of this being intended just for mount/umount in https://github.com/container-storage-interface/spec/blob/master/spec.md#nodestagevolume afaik. Also Ive seen drivers where Stage/Unstage does the necessary work such as activation of volumes and then publish does the actual mount with read or readwrite.

@jakobmoellerdev
Copy link
Contributor Author

Okay this investigation aside, I believe this PR can get merged without issues. @llamerada-jp any concerns?

@cupnes
Copy link
Contributor

cupnes commented Apr 25, 2024

@jakobmoellerdev
On the other hand, I think it is a good to report to Kubernetes that the kubelet is not calling NodeExpandVolume for minutes.

@jakobmoellerdev
Copy link
Contributor Author

Agreed, I will take care to put this to the k8s community. Just as a follow-up, in https://github.com/topolvm/topolvm/actions/runs/8829012081/job/24239136537?pr=898#step:9:621 you can see that when I implement a Controller based Resize without NodeExtendVolume, then the test is as fast as the others, further solidifying my findings

@llamerada-jp
Copy link
Contributor

Sorry for the delay in reviewing this, I have been busy with other tasks. Good for the most part, and I'm still checking on the details.

@llamerada-jp llamerada-jp merged commit 9a916b2 into topolvm:main May 1, 2024
24 checks passed
@jakobmoellerdev
Copy link
Contributor Author

I just noticed this got merged without a squash. Maybe we should introduce a Github Action to stop merges if there is not one squashed commit?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

topolvm-node does not unmount btrfs volumes after running parted -l
4 participants