Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Introduce minimum PVC allocation sizing by creating minimum allocation settings #851

Merged
merged 1 commit into from
Apr 4, 2024

Conversation

jakobmoellerdev
Copy link
Contributor

@jakobmoellerdev jakobmoellerdev commented Mar 6, 2024

This PR introduces a minimum size allocation for any PVCs that are requested. It also refactors the test cases to be table driven so that its easier to modify and see whats wrong.

The limits introduced are the following:

const (
	// DefaultMinimumAllocationSizeBlock is the default minimum size for a block volume.
	// Derived from the usual physical extent size of 4Mi * 2 (for accommodating metadata)
	DefaultMinimumAllocationSizeBlock = "8Mi"
	// DefaultMinimumAllocationSizeXFS is the default minimum size for a filesystem volume with XFS formatting.
	// Derived from the hard XFS minimum size of 300Mi that is enforced by the XFS filesystem.
	DefaultMinimumAllocationSizeXFS = "300Mi"
	// DefaultMinimumAllocationSizeExt4 is the default minimum size for a filesystem volume with ext4 formatting.
	// Derived from the usual 4096K blocks, 1024 inode default and journaling overhead,
	// Allows for more than 80% free space after formatting, anything lower significantly reduces this percentage.
	DefaultMinimumAllocationSizeExt4 = "32Mi"
)

Why?
Since we removed the 1Gi Allocation we should set sensible defaults that allow users to still make use of our rounding but that do not go so low as to break the Mounting procedure.

What was changed?

The following new controller flags were introduced with the above defaults:

  • minimum-allocation-block
  • minimum-allocation-xfs
  • minimum-allocation-ext4

If not set, they use the defaults so the HELM chart does not need adjusting.
Note that setting 0 or negative values in these configurations disables the allocation sizing for that configuration.

How to reproduce?
Create a PVC with something like

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: simple-pvc
spec:
  storageClassName: sg1 # should be with default or explicit xfs setting
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Mi

One will notice the following error during mounting

Events:
  Type     Reason       Age                 From               Message
  ----     ------       ----                ----               -------
  Normal   Scheduled    2m4s                default-scheduler  Successfully assigned default/test-pod to cnv-qe-infra-23.cnvqe2.lab.eng.rdu2.redhat.com
  Warning  FailedMount  59s (x8 over 2m4s)  kubelet            MountVolume.SetUp failed for volume "pvc-1bba5d94-aa42-4f8e-a75e-fe16debdbd9a" : rpc error: code = Internal desc = mount failed: volume=17c48e86-3a9b-4ef0-83c1-698e0d6fd7b5, error=format of disk "/dev/topolvm/17c48e86-3a9b-4ef0-83c1-698e0d6fd7b5" failed: type:("xfs") target:("/var/lib/kubelet/pods/8b4ae8e7-55c0-45d5-805b-02496d66b782/volumes/kubernetes.io~csi/pvc-1bba5d94-aa42-4f8e-a75e-fe16debdbd9a/mount") options:("nouuid,defaults") errcode:(exit status 1) output:(size 3072 of data subvolume is too small, minimum 4096 blocks
....  (cut for brevity)

Additionally if one then increases that block amount to lets say 16 MiB one will eventually face (with a recent enough version of mkfs):

Events:
  Type     Reason       Age                 From               Message
  ----     ------       ----                ----               -------
  Normal   Scheduled    2m4s                default-scheduler  Successfully assigned default/test-pod to cnv-qe-infra-23.cnvqe2.lab.eng.rdu2.redhat.com
  Warning  FailedMount  59s (x8 over 2m4s)  kubelet            MountVolume.SetUp failed for volume "pvc-1bba5d94-aa42-4f8e-a75e-fe16debdbd9a" : rpc error: code = Internal desc = mount failed: volume=17c48e86-3a9b-4ef0-83c1-698e0d6fd7b5, error=format of disk "/dev/topolvm/17c48e86-3a9b-4ef0-83c1-698e0d6fd7b5" failed: type:("xfs") target:("/var/lib/kubelet/pods/8b4ae8e7-55c0-45d5-805b-02496d66b782/volumes/kubernetes.io~csi/pvc-1bba5d94-aa42-4f8e-a75e-fe16debdbd9a/mount") options:("nouuid,defaults") errcode:(exit status 1) output:(Filesystem must be larger than 300MB.)
....  (cut for brevity)

due to the change here: https://www.spinics.net/lists/linux-xfs/msg63099.html & https://www.spinics.net/lists/linux-xfs/msg63831.html

Copy link

@akalenyu akalenyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I am probably just misunderstanding things, and a small naming suggestion

internal/driver/controller.go Outdated Show resolved Hide resolved
internal/driver/controller.go Outdated Show resolved Hide resolved
@satoru-takeuchi
Copy link
Member

IMO, this kind of problem should be handled in the filesystem layer. Since this problem seems not to happen in the latest mkfs.xfs, I'd like to keep the current code as simple as possibile than applying this PR. @llamerada-jp How do you think?

@jakobmoellerdev
Copy link
Contributor Author

Just to make clear that this will happen in the latest versions of mkfs with 300 MiB. Before it would happen with 16 MiB and the Filesystem would be almost unusable. I can also introduce an E2E case to confirm this

@jakobmoellerdev
Copy link
Contributor Author

Also could you kindly tell what you mean with the "filesystem layer"? AFAIK the only alternative to the rounding would be to resize the volume during the mount procedure.

A Pod using a PVC with the StorageClass is completely unaware of this limitation.

@llamerada-jp
Copy link
Contributor

Originally, the minimum allocation size was 1 GiB. Therefore, I do not object to setting a new minimum allocation. However, we may avoid hard-coding these numbers. Some sites may not like to be allocated without their permission, and other file systems may have similar limitations. I do not want to make similar changes every time that happens. Therefore, how about specifying this number in the argument of the program or storage class? Is this an excessive function?

@jakobmoellerdev
Copy link
Contributor Author

jakobmoellerdev commented Mar 7, 2024

@llamerada-jp I am not opposed to that idea, I think this is a good point. My only question would be is how we should set this up. From what I understand the only way to give dynamic sizing configuration would be either through a new set of flags/env variables, or through a new custom configuration in lvmd (in which case however we wouldn't work through the CSI spec). We are currently missing detailed ways to configure sizing limitations in the gRPC controller logic.

I am open to ideas here. I would like to keep the configuration ability to configure not only flat limits, but also per VolumeMode/Type for specific Filesystems.

I am also open to discuss a Design Doc and am willing to take the implementation, just need to know what preferences we have here.

EDIT: We could start introducing a set of fields here:
https://kubernetes-csi.github.io/docs/external-provisioner.html?highlight=parameters#storageclass-parameters

with a pattern like size.minimum.topolvm.io/xfs=300Mi and then use this as a dynamic logic.

we could set

  • size.minimum.topolvm.io/xfs
  • size.minimum.topolvm.io/ext4
  • size.minimum.topolvm.io/block

@llamerada-jp
Copy link
Contributor

I am open to ideas here. I would like to keep the configuration ability to configure not only flat limits, but also per VolumeMode/Type for specific Filesystems.

It seems to be better to set it to storage class. Since fstype has already been set in the storage class, I think it is enough to add the storage class parameter as follows.

parameters:
  csi.storage.k8s.io/fstype: xfs
  topolvm.io/minimum-allocate(or minimum-size?): 300Mi

@jakobmoellerdev
Copy link
Contributor Author

Good point. Let me accomodate this change so that we can use it dynamically.

@llamerada-jp
Copy link
Contributor

I found a little problem when using the StorageClass to set new configurations: we cannot edit the existing SC. we have to recreate the SC with the same name for adding the configuration. I believe it is only a small limitation since we can continuously use existing PV/PVC.

@jakobmoellerdev
Copy link
Contributor Author

I think this tradeoff needs to be made clear in the docs for downstreams like OpenShift, but for TopoLVM simply documenting this should be fine.

@jakobmoellerdev
Copy link
Contributor Author

I have a patch incoming for this PR but I don't have the E2E test yet, will take a bit more time

@satoru-takeuchi
Copy link
Member

@jakobmoellerdev

Also could you kindly tell what you mean with the "filesystem layer"?

Oops, sorry. I meant filesystem "tool" layer, in this case, mkfs.xfs. In addition, I misunderstood the purpose of this PR. I considered that it is to stop the creation of too small volume.

I agree with adding a new parameter to SC as Yuji said.

I have a patch incoming for this PR but I don't have the E2E test yet, will take a bit more time

I'll review the updated version, thanks.

@jakobmoellerdev jakobmoellerdev force-pushed the xfs-minimum-size branch 2 times, most recently from b02cb17 to bb6bf8c Compare March 11, 2024 17:23
@jakobmoellerdev jakobmoellerdev changed the title fix: Introduce a minimum size for XFS formatted File System Volumes fix: Introduce a minimum size for XFS formatted file system volumes by creating a topolvm.io/minimum-allocated-size StorageClass parameter Mar 11, 2024
@jakobmoellerdev
Copy link
Contributor Author

Note to reviewers: Added the limitation generically since users might also want minimum sizes for block volumes

constants.go Outdated Show resolved Hide resolved
internal/driver/controller.go Outdated Show resolved Hide resolved
constants.go Outdated Show resolved Hide resolved
@jakobmoellerdev jakobmoellerdev changed the title fix: Introduce a minimum size for XFS formatted file system volumes by creating a topolvm.io/minimum-allocated-size StorageClass parameter feat: Introduce a minimum size for XFS formatted file system volumes by creating a topolvm.io/minimum-allocated-size StorageClass parameter Mar 12, 2024
@jakobmoellerdev
Copy link
Contributor Author

@llamerada-jp @satoru-takeuchi I have switched to using minimum-allocated-size-filesystem and minimum-allocated-size-block because the same storageclass might be used for both VolumeMode Filesystem and Block but the user might only want to limit Filesystem Volumes, not block storage. Thus we need to have these 2 flags. I also added some unit tests and made it so that negative capacities would result in 0 minimum. Please have another look

@jakobmoellerdev jakobmoellerdev force-pushed the xfs-minimum-size branch 2 times, most recently from 9ef2e07 to d5540b2 Compare March 12, 2024 16:34
@jakobmoellerdev
Copy link
Contributor Author

I'd like to know about "a separate set of minimums" in detail. I wonder why minimum allocation size might be changed for each deviceClass. In my understanding, the minimum allocation sizes are only for avoiding fs mount failure. So I still consider that it's a global configuration exists for each fs. Could you tell me a specific case where the minimum size differs from the global value (e.g. 300 MiB for XFS)? To be honest, I can't imagine such cases probably because I am not familiar with edge computing.

I think you are referring to the fact that xfs configuration would be global. And that is true for one node. However if we have 2 deviceClasses for 2 different nodes or nodeSets, then this changes. If Node A has a different fs default than node B then I would have to use 2 different controller configurations.

However I did notice that I did not consider that most users who want custom formatting could just use block devices and format the devices themselves. A common way to optimize formatting of ext4 for example is through disabling certain options that are not necessary on multi-gig devices: ^has_journal,^uninit_bg,^ext_attr,^huge_file,^64bit. This means that technically only deviceClass specifics for block devices would be necessary. And even I consider that a big edge case

I guess what you really want to accomplish at long last is setting not only minimum size but also many fs parameters like agsize and maxpct for each device class. Does it correct?

Yes that is correct. But since TopoLVM cannot do this right now the workaround for users is to create block devices and do fs mount/unmount themselves

Introducing a new ConfigMap only for testing is too much.

The ConfigMap (at least in my opinion) is a much better way to configure an application than using CLI flags for everything. This is why almost every major solution at some point starts offering filebased configuration instead of args, because args are hard to document and control/audit. But I see that we have trouble arguing for the value add here as well so let me push that to a separate PR so as not to block this effort here further.

With the above said:

  1. Lets just add global defaults for now without config file like you suggested to appeal to your request to scope down the PR
  2. Lets talk about custom fs options+minimum sizes in a separate issue?

@jakobmoellerdev jakobmoellerdev force-pushed the xfs-minimum-size branch 4 times, most recently from 024159c to 0dd1cf2 Compare March 22, 2024 11:15
@llamerada-jp
Copy link
Contributor

I was focusing on only avoiding errors in mkfs by adding a minimum allocation size on this PR. Therefore, I did not consider extensions about other options for mkfs. There may have been a misunderstanding there.

  1. Lets just add global defaults for now without config file like you suggested to appeal to your request to scope down the PR
  2. Lets talk about custom fs options+minimum sizes in a separate issue?

I agree with this suggestion. Since the design of mkfs options other than minimum allocation size will be complicated by referring to your examples, it would be better to make a proposal in a separate PR and discuss it once. I believe that the default values we are adding in this PR and the new options we are going to consider in the proposal are not in conflict. @satoru-takeuchi How do you think?

@satoru-takeuchi
Copy link
Member

Sorry for the late reply. I was on paid leave yesterday and the day before yesterday.

@llamerada-jp I agree with you.

@jakobmoellerdev Please update this PR in accordance with the following description.

With the above said:

  1. Lets just add global defaults for now without config file like you suggested to appeal to your request to scope down the PR
  2. Lets talk about custom fs options+minimum sizes in a separate issue?

FYI, in general, I agree with the following your opinion.

The ConfigMap (at least in my opinion) is a much better way to configure an application than using CLI flags for everything. This is why almost every major solution at some point starts offering filebased configuration instead of args, because args are hard to document and control/audit. But I see that we have trouble arguing for the value add here as well so let me push that to a separate PR so as not to block this effort here further.

We afraid to introduce ConfigMap just for edge cases. I'm glad if the design proposal of "ConfigMap-based custom fs options+minimum sizes" is also beneficial for non-edge users.

@jakobmoellerdev jakobmoellerdev force-pushed the xfs-minimum-size branch 2 times, most recently from adfea09 to f513745 Compare March 27, 2024 09:26
@jakobmoellerdev
Copy link
Contributor Author

Adjusted as discussed, PTAL again before I squash the commits. Thanks for the feedback!

@satoru-takeuchi
Copy link
Member

I'll finish reviewing this PR tomorro or the next to tomorror.

@llamerada-jp
Copy link
Contributor

@jakobmoellerdev @satoru-takeuchi
I would like to ask how many default values should be used for ext4 and block. From my study I have found the following. I think 128 KiB or 4MiB would be better, how do you think?

  • Specifying 1 KiB for PVC does not cause an error. However, 4MiB was allocated in my environment. This is because the size of LVM PE was 4MiB in my environment.
  • The minimum PE size is 128 KiB, but it seems that 4 MiB is set in most Linux environments.

man vgcreate

     -s|--physicalextentsize Size[m|UNIT]
          Sets the physical extent size of PVs in the VG.  The value must be either a power of 2 of at least 1 sector (where
          the sector size is the largest sector size of the PVs currently used in the VG), or at least 128KiB.  Once this
          value has been set, it is difficult to change without recreating the VG, unless no extents need moving.

/etc/lvm/lvm.conf (ubuntu 22.04)

    # Configuration option allocation/physical_extent_size.
    # Default physical extent size in KiB to use for new VGs.
    # This configuration option has an automatic default value.
    # physical_extent_size = 4096

@satoru-takeuchi
Copy link
Member

@llamerada-jp IMO, 4MiB is enough.

@jakobmoellerdev
Copy link
Contributor Author

I believe we are not taking into consideration the usual default value of pv_min_size:

	# Configuration option devices/pv_min_size.
	# Minimum size in KiB of block devices which can be used as PVs.
	# In a clustered environment all nodes must use the same value.
	# Any value smaller than 512KiB is ignored. The previous built-in
	# value was 512.
	# This configuration option has an automatic default value.
	# pv_min_size = 2048

This would lead to a minimum of at least 512KiB, but effectively 2MiB for most environments. Also together with what @llamerada-jp said, we need at least a full extent size of 4MiB to cover most default lvm installations.

Additionally when trying with a device of exactly 4MiBs we will receive

root@mainframe:~# lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
loop0         7:0    0     4M  0 loop 

lvcreate -L4MiB testvg -n testlv
  Volume group "testvg" has insufficient free space (0 extents): 1 required.

Even if we increase it, ext4 will use up quite a bit of space for metadata:

sblk
NAME            MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
loop0             7:0    0     8M  0 loop 

lvs
  LV     VG     Attr       LSize Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  testlv testvg -wi-a----- 4.00m

mkfs.ext4 /dev/testvg/testlv
mke2fs 1.47.0 (5-Feb-2023)
Discarding device blocks: done                            
Creating filesystem with 4096 1k blocks and 1024 inodes

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (1024 blocks): done
Writing superblocks and filesystem accounting information: done

df -H
Filesystem                 Size  Used Avail Use% Mounted on
/dev/mapper/testvg-testlv  2.9M   48k  2.6M   2% /mnt/testlv

As you can see, for a 8MiB lv formatted on ext4 with default settings on Fedora 39, we only have a size of 2.9M with a usable size of 2.6M (due to FS metadata).

Thus I recommend at least a minimum of 16-32MiB to have actually the amount available for use that was requested. Going any lower than that will not expose the requested size as available storage, but a smaller percentage.

@satoru-takeuchi
Copy link
Member

@jakobmoellerdev Thank you for your detailed explanation.

Thus I recommend at least a minimum of 16-32MiB to have actually the amount available for use that was requested.

OK. Then let's make it 32 MiB for now.

@jakobmoellerdev jakobmoellerdev force-pushed the xfs-minimum-size branch 2 times, most recently from 62d2b84 to f406903 Compare April 2, 2024 12:02
@jakobmoellerdev
Copy link
Contributor Author

@llamerada-jp @satoru-takeuchi

Have now the following limits:

  • XFS: 300Mi
  • ext4: 32Mi
  • block/raw: 8Mi

@satoru-takeuchi
Copy link
Member

@jakobmoellerdev It's reasonable. I'll finish to review soon.

@satoru-takeuchi
Copy link
Member

@jakobmoellerdev Please update PR title to reflect the latest changes. It no longer uses ConfigMap.

@jakobmoellerdev jakobmoellerdev changed the title feat: Introduce minimum PVC allocation sizing by creating an allocation configuration + ConfigMap for topolvm-controller feat: Introduce minimum PVC allocation sizing by creating minimum allocation settings Apr 3, 2024
Signed-off-by: Jakob Möller <jmoller@redhat.com>
@llamerada-jp llamerada-jp merged commit b9e881a into topolvm:main Apr 4, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

4 participants