Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CephFS volumes handling via PVCs (dynamic provisioning) #1125

Closed
jpds opened this issue Oct 23, 2017 · 26 comments
Closed

CephFS volumes handling via PVCs (dynamic provisioning) #1125

jpds opened this issue Oct 23, 2017 · 26 comments
Assignees

Comments

@jpds
Copy link

jpds commented Oct 23, 2017

Per #1115, it'd be interesting if CephFS volumes could be handled by storage classes and PVCs, I think this would make it a lot easier to deploy these volumes to helm charts for instance where people would just need to set the storageClass to something like rook-filesystem and then the accessModes to ReadWriteMany rather than hacking in YAML which is specific to Rook itself.

@kokhang
Copy link
Member

kokhang commented Oct 25, 2017

I think this is something we should do and aligns well with our block scenario. But one thing about Cephfs is that it allows you to provide "path" so that you can mount a path within the cephfs on a pod.

In kubernetes, mount options are only given during provisioner. So I am not sure how we could pass the path options during mount time. Ideally, on the spec we should have:

volumes:
- name: mysql-persistent-storage
   persistentVolumeClaim:
      claimName: mysql-pv-claim
   mountOptions: path

But I don't think something like this is supported in K8s.

@galexrt @travisn what is your take on this?

@galexrt
Copy link
Member

galexrt commented Oct 26, 2017

@kokhang From what I understand is that the path is essential for applications that share more than one directory.

Example: an application having dataABC/ and dataXYZ/ directories, this would require to "mount" both to the CephFS using the path option in two volumeMounts.

Also having one CephFS "per claim" would blow up the Ceph PG count too (From what I know about Ceph there is a "not enough PGs on a node" and "too many PGs on a node").

@kokhang
Copy link
Member

kokhang commented Oct 26, 2017

What if we can have just one cephFS and we share that? So every claim will just use a path.
So the path can be pvc-cephfs-1/ for one pvc and pvc-cephfs-2/ for the other. But the issue with sharing a filesystem between users is that there is no separation (other than the mount path)

@jpds
Copy link
Author

jpds commented Oct 26, 2017

Is it possible to have one CephFS (the same way that we have one 'rook-block', ie: 'rook-filesystem') that has multiple PVCs inside of it, which in turn are mounted by a deployment?

That deployment can then give it a path in its volumeMount for the volume.

@jpds
Copy link
Author

jpds commented Oct 26, 2017

No, looks like it's one CephFS per PV but I feel that the issue with PGs would happen with or without rook.

@kokhang
Copy link
Member

kokhang commented Oct 28, 2017

Submitted a design proposal to #1152. Please chime in there.

@kokhang kokhang added this to the 0.7 milestone Oct 30, 2017
@kokhang kokhang added this to Discuss in v0.6 Oct 30, 2017
@DanKerns DanKerns removed this from Discuss in v0.6 Nov 1, 2017
@DanKerns DanKerns added this to To Do in v0.7 via automation Nov 1, 2017
@jbw976 jbw976 removed this from To Do in v0.7 Feb 12, 2018
@jbw976 jbw976 modified the milestones: 0.7.5, 0.9 Feb 28, 2018
@travisn travisn added this to To do in v0.9 via automation Aug 2, 2018
@jbw976 jbw976 changed the title CephFS volumes handling via PVCs CephFS volumes handling via PVCs (dynamic provisioning) Aug 3, 2018
@jbw976 jbw976 added the ceph main ceph tag label Aug 3, 2018
@galexrt galexrt self-assigned this Sep 13, 2018
@kaoet
Copy link

kaoet commented Sep 14, 2018

We also want this feature.

I think the CephVolumeClient in the official client is a good reference.

@drake7707
Copy link

Would also like to see this. We're running kubernetes inside a docker container (using a docker-in-docker solution) for a clean setup/teardown and rapid on premise deployment. This causes issues with mounting the RBDs to the pods as the libceph used in the kernel expects the context to be the initial network namespace and --net=host is not an option.

@travisn travisn removed this from the 0.9 milestone Nov 20, 2018
@travisn travisn removed this from To do in v0.9 Nov 20, 2018
@whereisaaron
Copy link
Contributor

We are using this external CephFS provisioner, which is ideal from our point of view. And something like this included in Rook would be great.

https://github.com/kubernetes-incubator/external-storage/tree/master/ceph/cephfs

In response to a PVC for the provisioner's StorageClass, the provisioner creates a Ceph user and PV which maps to a directory (named after the PV ID/name) in a shared, underlying CephFS. This existing implementation matches what @kokhang describes above, except per-PVC Ceph users are created for extra separation over just the mount path.

These PVCs are ReadWriteMany mountable so can be shared between Pods. And deleting the PVC causes the provisioner to delete (or optionally retain) the folder in CephFS. Pods only get their PVC's directory mounted, so they can't see or access other directories. Very clean, very self-managing.

So when you have 100's of small PVC requirements, they become small directories on a decent-sized shared CephF, rather than 100's of separate file systems or block devices to manage.

You can horizontally scale the provisioner. If you have multiple Ceph clusters or CephFS instances, you can deploy the provisioner multiple times (with a different election ID and StorageClass).

This cephfs-provisioner is a 1:1 approach with the other external shared filesystem provisioners we use, like efs-provisioner and nfs-client provisioner, so all our deployments are totally agnostic as to which underlying system is used. We can even assign role-based StorageClass names in clusters like app-config or website-content-files and use or mix different provisioners in different clusters or cloud providers.

Before we had these external provisioners we use to directly mount directories (NFS, FlexVolume) in shared filesystems. We had to manage the directory structure and clean up the directories ourselves. It was doable, but a lot more hassle and more error prone.

@mykaul
Copy link
Contributor

mykaul commented Jan 6, 2019

I believe we'd prefer to move to CSI support (https://github.com/ceph/ceph-csi ) than the external provisioner?

@whereisaaron
Copy link
Contributor

@mykaul sure, that's an implementation decision for the project to make. CSI is certainly a forward-looking option. As a user, what I am looking for is the provisioning experience I describe above. In particular, the efficient management of large numbers of small PVC requirements. In a storage sub-system agnostic manner.

@huizengaJoe
Copy link

fwiw, I was just doing some research on possible shared file system options, I knew that rook supported ceph block dynamic provisioning and was assuming it also had same support for filesystem, was surprised it was not available

@travisn travisn added this to To do in v1.0 via automation Jan 16, 2019
@bordeo
Copy link

bordeo commented Mar 4, 2019

I'm really interested in this feature because many helm chart (eg: jenkins) ask for existing persistent volume claim name and not for flexvolume.

@mykaul
Copy link
Contributor

mykaul commented Mar 4, 2019

@bordeo - just for my curiosity, why would you use CephFS and not RBD for Jenkins jobs?

@bordeo
Copy link

bordeo commented Mar 4, 2019

Cause we need to share workspace across job pods that can run on multiple nodes. RBD is ReadWriteOnce or I'm wrong?

@mykaul
Copy link
Contributor

mykaul commented Mar 4, 2019

@bordeo - no, you are not wrong - that's a good use case. I was just wondering as many times Jenkins jobs do not share workspace and then performance is quite likely be better using their own RBD backed XFS file system.

@anguslees
Copy link
Contributor

anguslees commented Mar 5, 2019

So just to clear something up: rook/cephfs works just fine via PVCs, and has done so for approximately ever.

Here's a rook/cephfs PVC I use daily (from parallel workers in a complex jenkins job as it happens):

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: oe-scratch
  namespace: jenkins
spec:
  accessModes:
  - ReadWriteMany
  selector:
    matchLabels:
      name: oe-scratch
  storageClassName: rook-cephfs

The PVC can also be created from within a StatefulSet volumeClaimTemplate or anywhere else that PVCs show up in the k8s API.

What doesn't work (right now) with rook/cephfs is dynamic provisioning. ie: kubernetes can't take the above and magically create the underlying PV from a storageClass "factory" description. So to use the above PVC, the admin needs to create a cephfs PV manually (aka "static provisioning"). The one I created for the above is:

apiVersion: v1
kind: PersistentVolume
metadata:
  labels:
    name: oe-scratch
  name: oe-scratch
spec:
  accessModes:
  - ReadWriteMany
  capacity:
    storage: 200Gi
  flexVolume:
    driver: ceph.rook.io/rook
    fsType: ceph
    options:
      clusterNamespace: rook-ceph
      fsName: ceph-filesystem
  persistentVolumeReclaimPolicy: Recycle
  storageClassName: rook-cephfs

Note the storageClassName, accessModes, and selector labels need to match. This is unfortunately a little bit more than "just-works", but honestly for a read-write-many persistent volume that continuously gets re-used, it's not that much work. Importantly, it allows me to consume unmodified upstream manifests (via helm charts or whatever) since all they have to know is the appropriate PVC accessMode/selector/storageClassName values - there is no mention of flexVolume or other specifics at the PVC level.

So: rook/cephfs supports PVCs just fine, since PVCs are a volume-type-agnostic feature within kubernetes. What rook/cephfs does not (yet) support is "dynamic provisioning" of PVs, and new rook/cephfs PVs need to be explicitly created.

See whereisaaron's comment above for a "3rd party" dynamic provisioning solution that allows carving up a single (statically provisioned) cephfs PV into multiple smaller PVs (using subdirectories). Honestly, that's about as good as it could be within this space unless you need truly separated cephfs volumes (perhaps to support different underlying cephfs options).

See https://kubernetes.io/docs/concepts/storage/persistent-volumes/ for a more in-depth discussion.

@travisn travisn removed this from the 1.0 milestone Mar 15, 2019
@travisn travisn removed this from To do in v1.0 Mar 15, 2019
@konvergence
Copy link

konvergence commented Apr 4, 2019

Hi All

I use actually the projet : https://github.com/kubernetes-incubator/external-storage/tree/master/ceph/cephfs
To create create a storageclass on cephfs managed by rook v0.9.2

So actually each PVC created by kuberbenes is a folder on the cephfs but there is is no option to enable quotas on the cephfs folders to limit the usage on the size defined on the PVC.

Quota is a feature ever available on ceph.

So did you plan to implement the limit of PVC on cephfs with the quotas ?

I know that you wait CSI plugin integration to allow resizing of RBD.

@martin31821
Copy link

@konvergence can you share the manifests to build a setup like yours? Would be very helpful since we're running in the same issue.
Also, some performance numbers on block device vs cephfs would be nice.

@thomasjm
Copy link

One interesting note about the external provisioner: I noticed that the PersistentVolume created cephfs-provisioner is actually a CephFS persistent volume, which K8S evidently has native support for (see here).

On the other hand, the manually created PV given by @anguslees above uses FlexVolume.

They both seem to work okay, not sure which one is preferred though.

@MikeDaniel18
Copy link

What's the progress of this feature? It's been nearly 2 years since it was picked up, and though there seems to have been initial momentum it doesn't appear to have had development done on it. Can we track the progress somewhere?

@thomasjm
Copy link

FWIW, there's an open PR to more or less bring cephfs-provisioner into Rook: #3264

There's some discussion over there about whether to finish that PR or wait for the CSI driver, which would have dynamic provisioning. From that thread:

The goal is to use the Ceph CSI as default and also automatically set it up through the Rook Ceph operator in v1.1.

I'm hoping that this solution works the same as cephfs-provisioner by making each PV a subfolder of a Ceph filesystem. I haven't tried it yet because I was scared of #3315 mentioned above.

I'd love to see this added to the roadmap (in a way that specifically includes dynamic provisioning). For example, #2650 is a 1.1 roadmap item but it doesn't seem to include the actual provisioner. @travisn do you happen to know the status of this? Thanks!

@MikeDaniel18
Copy link

@thomasjm Thank you for digging that one up. That does seem to imply things are making progress one way or another.

Perhaps I'll hold off using it until 1.1 is released.

@ajarr
Copy link
Contributor

ajarr commented Aug 14, 2019

Since #3562 merged in rook master, rook master uses Ceph CSI by default to dynamically provision PVs backed by CephFS. Documentation is here,
https://github.com/rook/rook/blob/master/Documentation/ceph-filesystem.md#provision-storage

@travisn , should this issue be open until rook v1.1 is released?

@galexrt
Copy link
Member

galexrt commented Sep 6, 2019

@ajarr Let's keep this issue open, in lights of this PR #3264 as it might or might not still implement the feature for Rook CephFS flexvolume driver.

But yes in generally as Ceph CSI is the direction going forward, the issue itself is "resolved" with the latest merges to master.

@galexrt
Copy link
Member

galexrt commented Oct 18, 2019

Thinking about it more, as Ceph CSI is a way to have dynamic provisioning of CephFS and available in Rook v1.1 we can close this issue.

#3264 might bring Flexvolume support for dynamic provisioning as well in the future.

@galexrt galexrt closed this as completed Oct 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.