Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mds: create EC pool as secondary pool #8452

Merged
merged 1 commit into from Dec 7, 2021

Conversation

subhamkrai
Copy link
Contributor

@subhamkrai subhamkrai commented Aug 2, 2021

when creating ec fs, create replicated pool as primary
pool and ec pool as secondary pool, creating ec pool
as primary is not encouraged and it will lead to failure.

Closes: #8210
Signed-off-by: subhamkrai srai@redhat.com

Description of your changes:

Which issue is resolved by this Pull Request:
Resolves #8210

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
  • Skip Tests for Docs: Add the flag for skipping the build if this is only a documentation change. See here for the flag.
  • Skip Unrelated Tests: Add a flag to run tests for a specific storage provider. See test options.
  • Reviewed the developer guide on Submitting a Pull Request
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.
  • Pending release notes updated with breaking and/or notable changes, if necessary.
  • Upgrade from previous release is tested and upgrade user guide is updated, if necessary.
  • Code generation (make codegen) has been run to update object specifications, if necessary.

@subhamkrai subhamkrai requested a review from travisn August 2, 2021 05:48
@mergify mergify bot added the ceph main ceph tag label Aug 2, 2021
Documentation/ceph-filesystem-crd.md Outdated Show resolved Hide resolved
@subhamkrai subhamkrai marked this pull request as ready for review August 3, 2021 03:03
Documentation/ceph-filesystem-crd.md Outdated Show resolved Hide resolved
Documentation/ceph-filesystem-crd.md Outdated Show resolved Hide resolved
Documentation/ceph-filesystem-crd.md Outdated Show resolved Hide resolved
@leseb
Copy link
Member

leseb commented Sep 9, 2021

@subhamkrai what's the status of this?

@github-actions
Copy link

github-actions bot commented Oct 9, 2021

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Labeled by the stale bot label Oct 9, 2021
@travisn travisn removed the stale Labeled by the stale bot label Oct 11, 2021
@subhamkrai
Copy link
Contributor Author

@subhamkrai what's the status of this?

the problem is the solution in working(or I'm missing something), will into this week

Comment on lines 102 to 104
```console
mkdir /mnt/cephfs/myssddir
$ setfattr -n ceph.dir.layout.pool -v <ec-pool-name> /mnt/cephfs/myssddir #example ec-pool-name, myfs-ec
Copy link
Member

@BlaineEXE BlaineEXE Oct 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we making users go through all these manual hoops to get a working file store?

IMO it would be better treat the first pool given as the default pool and configure this automatically if the user supplies multiple pools. If the user supplies a default pool that is EC, we can rely on Ceph to fail for us.

Also, this user #8210 (comment) reported that they followed steps here with no success.

When Rook loads the pool list, does it keep the same order from k8s? If not, that could be a source of failure.

@subhamkrai
Copy link
Contributor Author

@batrick can you review this PR?

Copy link
Contributor

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise lgtm

Documentation/ceph-filesystem-crd.md Outdated Show resolved Hide resolved
@subhamkrai subhamkrai force-pushed the ec-file-layout branch 2 times, most recently from 13a643f to f8f5095 Compare November 8, 2021 09:04
@subhamkrai
Copy link
Contributor Author

subhamkrai commented Nov 8, 2021

Steps I did to create ec file-system

  1. kubectl create -f common.yaml -f crds.yaml
  2. kubectl create -f operator.yaml
  3. kubectl create -f cluster.yaml
  4. kubectl create -f filessystem-ec.yaml
  5. kubectl create -f csi/cephfs/storageclass-ec.yaml
  6. kubectl create -f csi/cephfs/pvc.yaml
  7. kubectl create -f csi/cephfs/pod.yaml

Test result:
(Notice changes in data.usage)

  1. ceph status without any data
sh-4.4$ ceph status
  cluster:
    id:     408d62f7-130d-4522-b531-021c79fc14e5
    health: HEALTH_WARN
            clock skew detected on mon.b
 
  services:
    mon: 3 daemons, quorum a,b,c (age 9m)
    mgr: a(active, since 10m)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 9m), 3 in (since 9m)
 
  data:
    volumes: 1/1 healthy
    pools:   6 pools, 161 pgs
    objects: 36 objects, 3.2 MiB
    usage:   35 MiB used, 59 GiB / 59 GiB avail
    pgs:     161 active+clean
 
  io:
    client:   1.2 KiB/s rd, 2 op/s rd, 0 op/s wr
  1. ceph status when I added one file with single line.
sh-4.4$ ceph status
  cluster:
    id:     408d62f7-130d-4522-b531-021c79fc14e5
    health: HEALTH_WARN
            clock skew detected on mon.b
 
  services:
    mon: 3 daemons, quorum a,b,c (age 11m)
    mgr: a(active, since 11m)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 11m), 3 in (since 11m)
 
  data:
    volumes: 1/1 healthy
    pools:   6 pools, 161 pgs
    objects: 36 objects, 3.2 MiB
    usage:   36 MiB used, 59 GiB / 59 GiB avail
    pgs:     161 active+clean
 
  io:
    client:   1.5 KiB/s rd, 1 op/s rd, 0 op/s wr
  1. ceph status when I added one more file with multiple files
sh-4.4$ ceph status
  cluster:
    id:     408d62f7-130d-4522-b531-021c79fc14e5
    health: HEALTH_WARN
            clock skew detected on mon.b
 
  services:
    mon: 3 daemons, quorum a,b,c (age 12m)
    mgr: a(active, since 12m)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 12m), 3 in (since 12m)
 
  data:
    volumes: 1/1 healthy
    pools:   6 pools, 161 pgs
    objects: 36 objects, 3.8 MiB
    usage:   38 MiB used, 59 GiB / 59 GiB avail
    pgs:     161 active+clean
 
  io:
    client:   853 B/s rd, 1 op/s rd, 0 op/s wr
sh-4.4$ ceph osd erasure-code-profile get default
k=2
m=2
plugin=jerasure
technique=reed_sol_van

Application pod output:

# pwd
/var/lib/www/html
# ls
lost+found  test  test1
# 

@travisn let me know if I missed any thing to verify it's working

@subhamkrai subhamkrai force-pushed the ec-file-layout branch 2 times, most recently from 18e87b5 to 58e3468 Compare November 8, 2021 09:16
@subhamkrai subhamkrai changed the title ceph: add a directory layout for secondary ec data pool mds: add a directory layout for secondary ec data pool Nov 8, 2021
@travisn
Copy link
Member

travisn commented Nov 8, 2021

From the overall ceph status I can't tell if the data is in the EC pool. What about a command such as ceph osd df or ceph osd df detail? Can you see the number of objects in each OSD growing? To confirm, I would expect 4 new objects to show up in the OSDs for every file you're creating, although not sure if ceph osd df will give you the details to see that.

@subhamkrai
Copy link
Contributor Author

on toolbox pod

sh-4.4$ ceph df detail
--- RAW STORAGE ---
CLASS     SIZE    AVAIL    USED  RAW USED  %RAW USED
hdd    117 GiB  117 GiB  93 MiB    93 MiB       0.08
TOTAL  117 GiB  117 GiB  93 MiB    93 MiB       0.08
 
--- POOLS ---
POOL                   ID  PGS  STORED  (DATA)   (OMAP)  OBJECTS     USED   (DATA)   (OMAP)  %USED  MAX AVAIL  QUOTA OBJECTS  QUOTA BYTES  DIRTY  USED COMPR  UNDER COMPR
device_health_metrics   1   64     0 B     0 B      0 B        0      0 B      0 B      0 B      0     37 GiB            N/A          N/A    N/A         0 B          0 B
myfs-ec-metadata        2  256  76 KiB  73 KiB  3.2 KiB       25  322 KiB  312 KiB  9.7 KiB      0     37 GiB            N/A          N/A    N/A         0 B          0 B
myfs-ec-data0           3   32   158 B   158 B      0 B        2   12 KiB   12 KiB      0 B      0     37 GiB            N/A          N/A    N/A         0 B          0 B
myfs-ec-data1           4   32     0 B     0 B      0 B        0      0 B      0 B      0 B      0     74 GiB            N/A          N/A    N/A         0 B          0 B
sh-4.4$```

on application pod
```console
ls -alh
total 4.5K
drwxrwxrwx 2 root root    1 Nov  9 09:09 .
drwxr-xr-x 3 root root 4.0K Nov  9 09:02 ..
-rw-r--r-- 1 root root  327 Nov  9 09:07 file1

I manually created file name file1 for testing(had very small data text data) and we can see data in myfs-ec-data0.
So, we can verify it is working as expected. Thanks @leseb

@subhamkrai subhamkrai changed the title mds: add a directory layout for secondary ec data pool mds: create EC pool as secondary pool Nov 9, 2021
@subhamkrai
Copy link
Contributor Author

@travisn should be also updated in default filesystem-ec.yaml accordingly maybe added as a comment?

@subhamkrai
Copy link
Contributor Author

subhamkrai commented Nov 24, 2021

I tested with current changes looks like it is working now. We can see data being field in myfs-ec-data1 too
file-ec

So there are 3 pools:

  • myfs-ec-metadata
  • myfs-ec-data0
  • myfs-ec-data1

Which of data0 or data1 is the replicated pool? Shouldn't it be called rep-data something instead? It's confusing.
I agree it is confusing.

According to my understanding

Ok, can we not name the replicated pool ec, just to not confuse users?

I update that but I would like to do it in follow-up PR. The only reason is this PR is lon

I tested with current changes looks like it is working now. We can see data being field in myfs-ec-data1 too
file-ec

So there are 3 pools:

  • myfs-ec-metadata
  • myfs-ec-data0
  • myfs-ec-data1

Which of data0 or data1 is the replicated pool? Shouldn't it be called rep-data something instead? It's confusing.
I agree it is confusing.

According to my understanding

Ok, can we not name the replicated pool ec, just to not confuse users?

I would like to do that in follow-up PR because this PR is here for some time I would like to get it merged. I'll create an issue to track.

@leseb
Copy link
Member

leseb commented Nov 24, 2021

I tested with current changes looks like it is working now. We can see data being field in myfs-ec-data1 too
file-ec

So there are 3 pools:

  • myfs-ec-metadata
  • myfs-ec-data0
  • myfs-ec-data1

Which of data0 or data1 is the replicated pool? Shouldn't it be called rep-data something instead? It's confusing.
I agree it is confusing.

According to my understanding

LGTM!

Copy link
Member

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to hear the testing is working now!

@@ -14,7 +14,7 @@ parameters:

# Ceph pool into which the volume shall be created
# Required for provisionVolume: "true"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding a comment here also about the required replicated data pool?

@@ -17,6 +17,8 @@ spec:
size: 3
# The list of data pool specs
dataPools:
- replicated:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And let's add the comment here. Better to document in multiple places so it's more discoverable

@@ -94,8 +96,7 @@ spec:
activeStandby: true
```

(These definitions can also be found in the [`filesystem-ec.yaml`](https://github.com/rook/rook/blob/{{ branchName }}/cluster/examples/kubernetes/ceph/filesystem-ec.yaml) file.
Also see an example in the [`storageclass-ec.yaml`](https://github.com/rook/rook/blob/{{ branchName }}/cluster/examples/kubernetes/ceph/csi/cephfs/storageclass-ec.yaml) for how to configure the volume.)
**IMPORTANT**: For erasure coded pools, we have to create a replicated pool as the default data pool and an erasure-coded pool as a secondary pool.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we get a confirmation from the cephfs team about the setattr that was suggested before? It isn't needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU, if you want a CephFS's dir tree data to be stored in a specific data pool (an ec pool in this scenario), then you'd need to set the dir.layout.pool attribute as done here, https://docs.ceph.com/en/latest/cephfs/file-layouts/#adding-a-data-pool-to-the-file-system

@batrick @kotreshhr ^^

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what we wanted to confirm as while testing I didn't run any command for file layout and we can see here data is in the erasure-coded pool.

@ajarr
Copy link
Contributor

ajarr commented Nov 26, 2021

@subhamkrai , later in the doc 'Documentation/ceph-filesystem-crd.md' I see,

dataPools: The settings to create the filesystem data pools. If multiple pools are specified, Rook will add the pools to the filesystem. Assigning users or files to a pool is left as an exercise for the reader with the CephFS documentation. The data pools can use replication or erasure coding. If erasure coding pools are specified, the cluster must be running with bluestore enabled on the OSDs.

In the block above, you may want to add a sentence explicitly saying that the erasure coding pool should not be used as the default primary data pool.

FYI, I've requested a clarification here, #8210 (comment)

@subhamkrai
Copy link
Contributor Author

@subhamkrai , later in the doc 'Documentation/ceph-filesystem-crd.md' I see,

dataPools: The settings to create the filesystem data pools. If multiple pools are specified, Rook will add the pools to the filesystem. Assigning users or files to a pool is left as an exercise for the reader with the CephFS documentation. The data pools can use replication or erasure coding. If erasure coding pools are specified, the cluster must be running with bluestore enabled on the OSDs.

In the block above, you may want to add a sentence explicitly saying that the erasure coding pool should not be used as the default primary data pool.

FYI, I've requested a clarification here, #8210 (comment)

Thanks for the review I'll update the block.

@mergify
Copy link

mergify bot commented Nov 30, 2021

This pull request has merge conflicts that must be resolved before it can be merged. @subhamkrai please rebase it. https://rook.io/docs/rook/latest/development-flow.html#updating-your-fork

@subhamkrai
Copy link
Contributor Author

@Omar007 with the current change in my testing ec pool is working with cephfs. Can you confirm if it working for you or not? Thanks

@Omar007
Copy link
Contributor

Omar007 commented Dec 1, 2021

If we're talking about the original problem and being able to set it up as documented with only EC pools? No, not at all. This is still just a work-around relying on a replica pool. The core problem is not addressed so that is still not going to work.
There was nothing stopping anyone from doing a replica based filesystem setup and using dir layouts. That was never the problem. The core problem is just being ignored/worked-around by creating a useless dummy pool. If it shouldn't need to be there in the first place and isn't ever used at all, it shouldn't exist.

You should be able to use just EC pools. Either a single pool of the EC type or even any amount; the dir layouts setup should also be possible with all EC pools (e.g. a 6+2 and a 2+1 or even more variations), without the use of a (replica) dummy pool. Neither are currently possible and this MR changes nothing to fix that, nor does it introduce anything new that wasn't already working.
This is of course assuming the documentation is correct and the core Ceph functionality is implemented correctly/as documented, for which I have not heard anything contrary to that yet in #8210 or anywhere else, nor have I encountered any (undocumented or breaking) complications myself.

As such, at this point there is no reason for me to put any more time into this subject until either the supposed-to-be-working setup is:

  • actually working by default (both in Ceph and Rook)
  • documented (in Ceph) to require the use of --force (this is what currently seems to be the only deviation between the documentation and implementation in Ceph; it currently never mentions this setup requires --force until you try to set it up as documented and run into it on the CLI. Probably because it does have a caveat and they want the end-user to be explicit about accepting that but it does work) and made usable in Rook.
  • officially declared dead/broken/not supposed to work in Ceph, the documentation no longer provides it as a valid option and the problems on why it's broken/can't be used are documented.

With the information I've currently seen, I'm inclined to say the only valid 'final' solution right now would be the first or second bullet.
For the first that basically means at the Ceph level it shouldn't require the use of --force for this use-case and Rook would already work.
For the second, Rook would have to either detect the use case and set the --force flag automatically as needed or expose a property (probably on the CRD) to allow/enable the end-user to request the use of --force during filesystem creation (which I mentioned before here ). The former would obviously be more user friendly but with the latter at least you required the user to acknowledge the possible caveat(s) and explicitly force them to accept that by having them set a property.

@manjo-git
Copy link

I was able to get filesystem-ec working with the patches mentioned in https://github.com/rook/rook/pull/8452/files please see this link #9244 (comment) for details.

@BlaineEXE
Copy link
Member

This is going to interfere a bit with this as well. #9296

Copy link
Member

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend we go ahead and get this merged. The instructions do enable a working EC filesystem as has been confirmed from multiple sources. Defining the extra replica pool is currently a requirement from cephfs, so will need separate follow-up on the possibility of removing that requirement. For now, let's at least merge this to fix the docs and example.
@subhamkrai Please do resolve the merge conflicts and review my final comments from last week.

@degorenko
Copy link
Contributor

Would you mind to add some validation on cephfs create, that first element of dataPools is not erasureCoded? This will be helpfull for runtime debug without inspecting docs.

@Omar007
Copy link
Contributor

Omar007 commented Dec 6, 2021

Could we not? Or at least only log it because blocking it completely would completely prevent manually intervening to enable a should-be-working use case. At least now I can still implement it as documented by manually intervening but if the operator puts a restriction here, even that is no longer possible :/

@travisn
Copy link
Member

travisn commented Dec 6, 2021

Could we not? Or at least only log it because blocking it completely would completely prevent manually intervening to enable a should-be-working use case. At least now I can still implement it as documented by manually intervening but if the operator puts a restriction here, even that is no longer possible :/

Logging a warning sounds fine so you can keep the manual workaround.

When creating EC fs, create replicated pool as primary
pool and ec pool as secondary pool, creating ec pool
as primary is not encouraged and it will lead to failure.

Also, changing the pool name in storageclass-ec file.

Closes: rook#8210
Signed-off-by: subhamkrai <srai@redhat.com>
@subhamkrai
Copy link
Contributor Author

I recommend we go ahead and get this merged. The instructions do enable a working EC filesystem as has been confirmed from multiple sources. Defining the extra replica pool is currently a requirement from cephfs, so will need separate follow-up on the possibility of removing that requirement. For now, let's at least merge this to fix the docs and example. @subhamkrai Please do resolve the merge conflicts and review my final comments from last week.

rebased!

Copy link
Member

@BlaineEXE BlaineEXE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine with the note that we will have to do a little work to consider in
#9296 how to verify the same thing. And maybe longer term we can find a more elegant solution.

@travisn
Copy link
Member

travisn commented Dec 7, 2021

Seems fine with the note that we will have to do a little work to consider in #9296 how to verify the same thing. And maybe longer term we can find a more elegant solution.

With the latest discussion in #9296 I don't see that the EC filesystem would be impacted other than you could choose to name the pools. The ordering of the pools in the CR would still be used.

@travisn travisn merged commit 3d69e10 into rook:master Dec 7, 2021
leseb added a commit that referenced this pull request Dec 8, 2021
mds: create EC pool as secondary pool (backport #8452)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ceph main ceph tag filesystem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

EC filesystem fails to deploy with latest release of Rook-Ceph
9 participants