Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very high CPU usage with v1.0.0 #3132

Open
vitobotta opened this issue May 7, 2019 · 162 comments

Comments

@vitobotta
Copy link

commented May 7, 2019

I am not sure where the problem is but I am seeing very high CPU usage since I started using v1.0.0. With three small clusters load average skyrockets to the 10s quite quickly making the nodes unusable. This happens while copying quite a bit of data to a volume mapped on the host bypassing k8s (to restore data from an existing non-k8s server). Nothing else is happening with the clusters at all. I am using low specs servers (2 cores, 8 GB of RAM) but I didn't see any of these high load issues with 0.9.3 on same-specs servers.
Has something changed about Ceph or else that might explain this? I've also tried with two providers, Hetzner Cloud and UpCloud. Same issue when actually using a volume.

Is it just me or is it happening to others as well? Thanks!

@vitobotta vitobotta added the bug label May 7, 2019

@davidkarlsen

This comment has been minimized.

Copy link
Contributor

commented May 7, 2019

I see the same (but chart v0.9.3) - it will just freeze and system load will increase and increase.

doing iostat -x 1 you will see 100% util on the RDB devices, but not actually any I/O

@vitobotta

This comment has been minimized.

Copy link
Author

commented May 8, 2019

I have tried with a new single-node cluster from scratch this time with 4 cores from UpCloud, which also happens to have the fastest disks (by far) I've seen with the cloud providers I have tried, so it's unlikely a problem with disks. Well, exactly the same problem. :( After a little while downloading many largish files like videos, the server became totally unresponsive I couldn't even SSH into it again. Like I said earlier, with the previous version of Rook I could do exactly the same operation (basically I am testing the migration of around 25GB of Nextcloud data from an old pre-Kubernetes server) even with servers with just 2 cores using Rook v0.9.3. I am going to try again with this version...

@vitobotta

This comment has been minimized.

Copy link
Author

commented May 8, 2019

Also, since I couldn't SSH into the servers I checked the web console from UpCloud and saw this:

Screenshot 2019-05-08 at 13 54 31

Not sure if it's helpful.... I was also wondering whether there are issues using Rook v1.0 with K3S, since I've used K3S with these clusters (but also with v0.9.3 which was OK). Perhaps I should also try with standard Kubernetes just to see if there's a problem there. I'll do this now..

@bengland2

This comment has been minimized.

Copy link

commented May 8, 2019

@vitobotta , I've seen this hung-task message when something like RBD or Cephfs is unresponsive and a VM thinks that the I/O subsystem is hung. So the question then becomes why is Ceph unresponsive? Is the Ceph cluster healthy when this happens? ceph health detail. Can you get a dump of your ceph parameters using the admin socket, something like "ceph daemon osd.5 config show". Does K8S show any Ceph pods in bad state?

You may want to pay attention to memory utilization by OSDs. What is the CGroup memory limit for rook.io OSD pods and what is the ceph.conf-defined osd_memory_target set to? Default for osd_memory_target is 4 GiB, much higher than default for OSD pod "resources": "limits". This can cause OSDs to exceed the CGroup limit. Can you do a "kubectl describe nodes" and look at what the memory limits for the different Ceph pods actually are? You may want to raise limits in cluster.yaml and/or lower osd_memory_target. Let me know if this helps. See this article on osd_memory_target

@vitobotta

This comment has been minimized.

Copy link
Author

commented May 8, 2019

Hi @bengland2, yes the clusters (I have tried with several) were healthy etc when I was doing these tests.

In the meantime I have recreated the whole thing again but this time with OpenEBS instead of Rook just to test, and while OpenEBS was slower I didn't have any issues at all, with load never above 4.

With Rook, same test on same specs it reached 40 or even more until I had to forcefully reboot, and this happened more than once. I am going to try once again with OpenEBS to see if I was just lucky...

@bengland2

This comment has been minimized.

Copy link

commented May 8, 2019

@vitobotta Sounds like you are copying files to an RBD volume. Try lowering your kernel dirty pages way down (e.g. sysctl -w vm.dirty_ratio=3 vm.dirty_background_ratio=1) on your RBD client and see if that makes write response times more reasonable. Also, maybe you need to give your OSDs more RAM, in rook this is done with resources: parameter. A Bluestore OSD expects to have > 4 GiB of RAM by default. Older rook.io may not be doing this by default. Ask me if you need more details.

@vitobotta

This comment has been minimized.

Copy link
Author

commented May 8, 2019

The weird thing is that I didn't seem to have these issues with the previous version, using the same specs and config. Not sure of what rbd client you mean, I just mounted the /dev/rbX device into a directory :)

@bengland2

This comment has been minimized.

Copy link

commented May 8, 2019

@vitobotta by "RBD client" I meant the host where you mounted /dev/rbdX. Also I expect you are using Bluestore not Filestore OSDs.

@vitobotta

This comment has been minimized.

Copy link
Author

commented May 8, 2019

I think Filestore since I was using a directory on the main disk rather than additional disks.

@bengland2

This comment has been minimized.

Copy link

commented May 8, 2019

Filestore is basically in maintenance mode at this point, you should be using Bluestore, which has much more predictable write latency. Let us know if Bluestore is giving you trouble.

@vitobotta

This comment has been minimized.

Copy link
Author

commented May 11, 2019

Hi @bengland2, I didn't read anywhere that Filestore (thus directories support?) is in not in active development, I must have missed it... I will try with additional disks instead of directories so I can test with Bluestore when I have time.

Today I had a chance to repeat the test with that 25GB of mixed data with a new 3-node cluster with Rook 1.0 installed. The test started well until it was extracting/copying videos, at which point once again the load average climbed quite quickly to over 70 on a node and 40 on another, so I had to forcefully reboot the two nodes. I uninstalled/cleaned up Rook completely, and repeated the test with OpenEBS first, and Longhorn after that. OpenEBS was again very very slow but worked, while Longhorn reached a load of max 12 when processing videos but then it completed the task and I was able to move on.

Also this time I am running standard Kubernetes 1.13.5, not K3S, so I have excluded both that it could be a problem with K3S, and that it could be a problem with the provider I was using before (Hetzner Cloud).

I don't know what to say... I hoped I could use Rook because it's faster and I have heard good things, but for me from these tests it looks almost unusable when dealing with large files. At least that's the impression I have unfortunately :(

I will try with disks instead of directories when I have a chance. Thanks

@vitobotta

This comment has been minimized.

Copy link
Author

commented May 11, 2019

I can't believe it! :D

I decided to try Bluestore now because I want to understand what's going on, so I set up a new cluster this time with DigitalOcean (3x 4 cores, 8GB ram) and added volumes to the droplets, so to use these disks with Ceph instead of a directory on the main disk. I was able to complete the usual test and the load never went above 5 when extracting videos!

I don't think it's because of DigitalOcean vs Hetzner Cloud/UpCloud, I guess the problem was as you suggested Filestore with directories. But out of curiosity why is there such a big difference in performance and CPU usage between Filestore and Bluestore? Thanks! I'm gonna try the experiment once again just in case, and if it works I will be very happy! :)

@vitobotta

This comment has been minimized.

Copy link
Author

commented May 12, 2019

Tried again and had the same problem. :(

@travisn travisn added this to the 1.0 milestone May 14, 2019

@travisn travisn added this to To do: v1.0.x patch release in v1.0 May 14, 2019

@BlaineEXE BlaineEXE added the ceph - label May 14, 2019

@BlaineEXE

This comment has been minimized.

Copy link
Member

commented May 29, 2019

I believe this may be an issue with Ceph itself. It's my understanding that the Ceph OSDs with Bluestore can use a lot of CPU in some circumstances. I think this is especially true for clusters with many OSDs and clusters with very fast OSDs.

Bluestore will generally result in better performance compared to Filestore, but the performance also comes with more CPU overhead. It's also my understanding that in today's hardware landscape, Ceph performance is often bottlenecked by CPU.

Update: I created a Ceph issue here https://tracker.ceph.com/issues/40068

@bengland2

This comment has been minimized.

Copy link

commented May 29, 2019

@BlaineEXE To see what they are doing about it, see project crimson. Ceph was designed in a world of HDDs, with 3 orders of magnitude less random IOPS per device. So yes it needs an overhaul, and they are doing that. Ceph is not the only application that is dealing with this.

@vitobotta

This comment has been minimized.

Copy link
Author

commented May 29, 2019

An update... as suggested by @BlaineEXE I did my usual test but using the latest Mimic image instead of Nautilus. It worked just fine with two clusters and managed to finish copying the test data with a very low CPU usage. I repeated this twice with two different clusters, successfully both times. For the third test, I just updated Ceph to Nautilus on the second cluster, and surprisingly the test finished ok again. But then I created a new cluster with Nautilus from the start and boom, usual problem. Started OK until I had to forcefully reboot the server. This is a single node cluster (4 cores, 16 GB of ram) with Rook 1.0.1 on Kubernetes 1.13.5 deployed with Rancher. There's a problem somewhere, I just wish I knew where.

@BlaineEXE

This comment has been minimized.

Copy link
Member

commented May 29, 2019

Is @sp98 still working on this issue? It would be great to see if there are any noticeable differences between how Rook starts up a Mimic cluster compared to how it starts up a Nautilus cluster to determine if Rook is the cause. We should also pay close attention to the behavior of ceph-volume, as the initial OSD prep using Mimic's c-v could be different than the prep using Nautilus' c-v.

@sp98

This comment has been minimized.

Copy link
Contributor

commented May 30, 2019

@BlaineEXE Yes. But had to move to 2696. Will jump back up to this one in few days time. Thanks for those updates above. I'll try that and update my findings here.

@vitobotta

This comment has been minimized.

Copy link
Author

commented May 31, 2019

Just tried once again with a new cluster, again with the latest version from the start, same problem. As of now I am still unable to actually use Rook/Ceph :( It's not like there are things that I could do wrong because it's so easy to install etc... so I don't know where to look. This time the problem occurred very quickly after I started copying data into a volume.

I was wondering, could it be something related to using a volume directly bypassing kubernetes?

Not sure if it's helpful, but what I am trying to do is download some data from an existing server into a volume so that I can use that data with Nextcloud. In order to do this, because there are timeouts etc if I try to do it from inside a pod, this is what I do to use the volume directly:

  • Install a new instance of Nextcloud, which has one volume for the data and one for the app html
  • scale the deployment to zero, so that there are no pods using the volumes
  • with the rook-toolbox, map the volumes, e.g.
rbd map <pvc> -p replicapool

which gives me the device name, e.g. /dev/rbd3

  • mount the volume into a temp directory on the host
mkdir nextcloud-data
mount /dev/rbd3 nextcloud-data
  • finally, download the data from the old server into this volume
mkdir -p nextcloud-data/_migration
cd nextcloud-data/_migration/
ssh old-server "cat /data/containers/nextcloud.tar" | tar -xvv

It starts downloading the data and then at some random point, sooner or later load will climb very very quickly up to 70-80 or more until I have to forcefully reboot the server. Since as said I don't know where to look, I am really confused by this problem and I even thought it may have something to do with the fact that I am extracting the archive while downloading it (I know, it doesn't make sense), but the problem occurs also if I just download the archive somewhere first and then extract it into the volume.

I am new to all of this so I wish I had more knowledge on how to further investigate :(
I don't have any of these issues when using Longhorn or OpenEBS but I would prefer using Rook at this point for the performance and because Ceph is an established solution, while the others are very new and have their own issues.

@travisn

This comment has been minimized.

Copy link
Member

commented May 31, 2019

@vitobotta Can you confirm if you have had this issue when deploying mimic (v13) or only with nautilus (v14)? If you haven't tried mimic with rook 1.0, could you try that combination? It would be helpful to confirm if this is a nautilus issue, or if it's rook 1.0 that causes the issue and it happens on both mimic and nautilus.

@markhpc @bengland2 Any other secrets up your sleeve for tracking down perf issues besides what has been mentioned? Thanks!

@vitobotta

This comment has been minimized.

Copy link
Author

commented May 31, 2019

Hi @travisn , I did a couple of tests with Mimic the other day and didn't have any problems with it. I just tried again with Mimic (v13.2.5-20190410) right now and all was good. Since I was always using Rook 1.0.1, it seems like it may be an issue with Nautilus? I am using Ubuntu 18.04 with 4.15.0-50-generic, if that helps somehow. Once I did a test with Fedora 29 (I think?) as suggested by @galexrt and it worked fine, I don't know if I was just lucky.... perhaps I can try again... to see if it happens only with Ubuntu.

@vitobotta

This comment has been minimized.

Copy link
Author

commented Jun 1, 2019

Hi all, I have done some more tests with interesting results. By "tests" I don't mean anything scientific since I lack deeper understanding of how this stuff works. I mean the usual download of data into a volume as described earlier.

I have repeated the same test with multiple operating systems and these are the results:

  • Ubuntu 18.04: download/copy fails each time. Starts OK but then at some point (sometimes right away, sometimes even towards the end) it stops and I have to forcefully reboot the server because it becomes unresponsibe;

  • Fedora 29: I have tried 3 times, no problems;

  • CentOS 7: I have tried just once and had no problems;

  • RancherOS 1.5.2: I tried twice, one using the Ubuntu console and the other one using the Fedora console. The test failed both times but my understanding is that RancherOS uses Ubuntu kernel, although I am not 100% sure.

  • Finally, I tried three times with Ubuntu but upgrading the kernel to 5.0.0.15 before setting things up/doing the test. Each time the test worked fine without any problems.

I don't know enough about this stuff to jump to conclusions but is it possible that there is a problem with Nautilus and the default Ubuntu 18.04 kernel? To exclude the possibility that it might be a problem with the customised kernel used by the provider, I have tried on Hetzner Cloud, UpCloud and DigitalOcean with the same result: the problem occurs with the default kernel but not with 5.0.0.15.

Is there anyone so kind as to try and reproduce this? Please note that I as far as I remember I haven't seen the problem copying little amounts of data. It always happens when I copy that 24-5 GB of data that I am trying to migrate or also sometimes when I run a benchmark on a volume with fio. Thanks a lot in advance if someone can reproduce this / look into it. :)

@vitobotta

This comment has been minimized.

Copy link
Author

commented Jun 1, 2019

Guys... I tried again with the 5.0.0.15 kernel and it happened again :( The first test copying the data into the volume was fine, but then I did a backup with Velero followed by a restore and the system became unresponsive during the restore, as usual...

@dyusupov

This comment has been minimized.

Copy link
Contributor

commented Jun 1, 2019

@ftab

This comment has been minimized.

Copy link

commented Jun 6, 2019

I'm running into this problem as well. It's causing my Kubernetes node to flap back and forth between NotReady and Ready; containers fail to start up, even a web browser or the system monitor lock up. The system ends up with over 1,000 processes eventually, and I think it's also causing my VirtualBox to not be able to start

Currently on a bare metal single node master--k8s 1.14.1, rook deployed from release-1.0 branch with storageclass-test.yml and cluster-test.yml (except that databaseSizeMB, journalSizeMB, and osdsPerDevice was commented out).

Host is running Ubuntu 18.04.2 (currently 4.18.0-20-generic kernel) and has 2x 10-core Xeon (20 cores, 40 threads total) with 96 GB of registered DDR4 running at 2133. 1 TB 970 EVO Plus NVMe drive. Suffice it to say, it should have plenty of CPU, RAM, and I/O speed...

edit: iostat -x 1 shows utilization going very high on the NVME device most of the time - but almost no utilization (0-1%) on the rbd devices

@iMartyn

This comment has been minimized.

Copy link
Contributor

commented Aug 5, 2019

As I understand it @vitobotta if you leave it set to 0, on the latest ceph image, osd_memory_target will be 0.8*(whatever limit you set in k8s).

@vitobotta

This comment has been minimized.

Copy link
Author

commented Aug 5, 2019

As I understand it @vitobotta if you leave it set to 0, on the latest ceph image, osd_memory_target will be 0.8*(whatever limit you set in k8s).

OK. I repeated my usual tests many times and while with these settings it seems more stable, I had to forcefully reset the server twice due to the system being stalled with high io wait.

@ron1

This comment has been minimized.

Copy link

commented Aug 5, 2019

@iMartyn @vitobotta My take on the comment link below from JoeT on the slack channel is that the latest ceph image is using the wrong value for this computation.

https://rook-io.slack.com/archives/CK9CF5H2R/p1564691260223200

... it is a ceph bug ... instead of looking at the system cgroup memory.limit it should look at the process limit

Am I interpreting his comment correctly? Like @vitobotta my 1.0.4/14.2.2 cluster is significantly more stable with the osd_memory_limit set to 0.8*(resources memory limit).

@bengland2

This comment has been minimized.

Copy link

commented Aug 5, 2019

@ron1

This comment has been minimized.

Copy link

commented Aug 5, 2019

@bengland2 Thanks for the link. The link includes the following comment:

Note that because Bluestore is computing the osd_memory_target for us based on the CGroup limit, there is no way to override this, so this is a high priority problem!

Is the comment true for Rook-Ceph-based containerized deployments? Or does Rook-Ceph indeed allow the osd_memory_target override via the rook-ceph-override ConfigMap?

@iMartyn

This comment has been minimized.

Copy link
Contributor

commented Aug 5, 2019

@ron1 my guess it is it not true :

// set osd_memory_target *default* based on cgroup limit?

in ceph/ceph@fc3bdad#diff-a9faffcf40600fd57aea5451cef5abe9 indicates that it will set the default, and the conf file (which is what the configmap is) should override it.

@markhpc

This comment has been minimized.

Copy link

commented Aug 5, 2019

Hold the phones! I think I managed to trigger it on a big fio r/w :

<snip>

Here's the collectl : https://public.objects.martyn.berlin/worker-20190804-170145.raw.gz?AWSAccessKeyId=BRN19F16P9HV1D6A3SNM&Expires=1565536274&Signature=zDpyKuGwOVQwnbOqUMJ60xq5Uf8%3D

My gut instinct is a kernel bug but we're seeing it across different machines and one of them was at one point the latest debian and did it but currently they're all on the previous stable debian. @markhpc thoughts?

@iMartyn I don't seem to be able to download the collectl data unfortunately. Happy to take a look though!

@bengland2

This comment has been minimized.

Copy link

commented Aug 5, 2019

No, I don't think there is a way to override osd_memory_target. I just checked by deploying Ceph. My OSD 18 has a 6-GiB memory limit as displayed by OpenShift "oc get nodes" (like kubectl get nodes), and Ceph osd_memory_target is set to

[root@e23-h21-740xd ceph]# oc rsh rook-ceph-osd-18-7db5c8c54d-lnm58
sh-4.2# ceph daemon osd.18 config show | grep osd_memory_target
    "osd_memory_target": "6442450944",
    "osd_memory_target_cgroup_limit_ratio": "0.800000",

Even though I set this config map in rook prior to deploying the OSDs:

[root@e23-h21-740xd ceph]# oc get configmap rook-ceph-config -o yaml
apiVersion: v1
data:
  ceph.conf: |+
    [global]
    mon_allow_pool_delete     = true
    mon_max_pg_per_osd        = 1000
    osd_pg_bits               = 11
    osd_pgp_bits              = 11
    osd_pool_default_size     = 1
    osd_pool_default_min_size = 1
    osd_pool_default_pg_num   = 100
    osd_pool_default_pgp_num  = 100
    rbd_default_features      = 3
    fatal_signal_handlers     = false
    osd_memory_target         = 4000000000

kind: ConfigMap
metadata:
  creationTimestamp: "2019-08-05T17:32:10Z"
  name: rook-ceph-config
  namespace: rook-ceph
  ownerReferences:
  - apiVersion: ceph.rook.io/v1
    blockOwnerDeletion: true
    kind: CephCluster
    name: rook-ceph
    uid: ec7dfcc6-b7a6-11e9-bcec-3cfdfec16e30
  resourceVersion: "116554"
  selfLink: /api/v1/namespaces/rook-ceph/configmaps/rook-ceph-config
  uid: f45a45d9-b7a6-11e9-bcec-3cfdfec16e30
@ron1

This comment has been minimized.

Copy link

commented Aug 5, 2019

@bengland2 The Rook ConfigMap must be named "rook-config-override", not "rook-ceph-config". See https://github.com/rook/rook/blob/master/Documentation/ceph-advanced-configuration.md#kubernetes-1

@bengland2

This comment has been minimized.

Copy link

commented Aug 5, 2019

So the YAML that I used in operator-openshift-with-csi.yaml was:

kind: ConfigMap
apiVersion: v1
metadata:
  name: rook-config-override
  namespace: rook-ceph
data:
  config: |
    [global]
    osd_memory_target=4000000000

I think it merges this config map into the other one.

@ron1

This comment has been minimized.

Copy link

commented Aug 5, 2019

@bengland2 I am using filestore osds and not bluestore osds at the moment. Is it possible that filestore osds honor the rook-config-override osd_memory_target value and bluestore osds do not?

When I exec into the osd pod, the rook-ceph.config file referenced by process parameter "ceph-osd --conf" indeed has the rook-config-override osd_memory_target value.

@vitobotta

This comment has been minimized.

Copy link
Author

commented Aug 5, 2019

@ron1 What kind of tests are you running with your set up? Have you had any problems at all with that setting? I am confused at it seems it's not needed?

@bengland2

This comment has been minimized.

Copy link

commented Aug 5, 2019

@ron1, I think its the other way around. Bluestore OSDs honor osd_memory_target and filestore OSDs do not. Filestore uses the Linux buffer cache and doesn't cache data in the OSD, so there is not as much memory pressure on Filestore OSDs (until Linux buffer cache runs out of free memory). However, Bluestore OSDs need to actually control how much memory they utilize for their data, including the cache, which is process-local.

@markhpc

This comment has been minimized.

Copy link

commented Aug 5, 2019

@iMartyn Looked through the collectl data. Load peaks around 5.2 and while the disk is super busy with a big queue, it doesn't look like the same kind of big stalls we saw previously. I also noticed lower commit memory and fewer major pagefaults. I think we're still missing something.

@ron1 What bengland2 said!

@iMartyn

This comment has been minimized.

Copy link
Contributor

commented Aug 5, 2019

@markhpc damn, was hoping we caught it. I have added the override and I haven't had a crash yet, but the read-write test stomped over everything like we knew it would and so rook on next boot said "Oooh, new fresh disk, lemme osd that for you" which then made it shuffle things around and I'm just waiting for the reshuffle back so I only have one OSD (keeping things simple here). So I haven't tried any load yet, just the reshuffling is going.

@ron1

This comment has been minimized.

Copy link

commented Aug 5, 2019

@bengland2 It makes sense that bluestore would be more sensitive to osd_memory_target than filestore. Nevertheless, using filestore, the rook-config-override is indeed working in that the osd_memory_target value is being correctly injected into the pod rook-ceph.config file in my environment.

Your previous comment led me to believe that using the rook-config-override to explicitly set osd_memory_target does NOT work at least for bluestore. Am I understanding your comment correctly?

@jbw976 jbw976 modified the milestones: v1.1, 1.1 Aug 6, 2019

@markhpc

This comment has been minimized.

Copy link

commented Aug 6, 2019

Hi Folks,

A very interesting thread from 2 days ago that sounds very familiar over on LKML:

https://lkml.org/lkml/2019/8/4/15

@bengland2

This comment has been minimized.

Copy link

commented Aug 6, 2019

@ron1, there is a difference between having the osd_memory_target set in the pod and having the Ceph code actually releasing memory to enforce it. Bluestore enforces the OSD memory target, but the enforcement doesn't kick in until it has exceeded the target. Unfortunately because of the bug described in the above tracker, it's too late and the CGroup limit has been exceeded. However, there is a proposed PR in place to fix it, and this should hopefully be backported to Nautilus.

As for the lkml post, that is just one example of why Ceph manages its own memory. For filestore users, I've found that lowering dirty_{background_ratio,ratio} drastically can lower latency and increase throughput in at least some cases, and I suspect that the /sys/block/device/bdi/max_ratio could have been used to create more fairness in writeback among competing HDDs, but there is no need for that investigation with Bluestore.

HTH -ben

@markhpc

This comment has been minimized.

Copy link

commented Aug 7, 2019

If anyone happens to have a kernel with PSI available that is experiencing these stalls, that might be a next step to try and get evidence of the specific resources being contended (and it appears that we can get it per cgroup):

https://lwn.net/Articles/759658/

Edit: I have a feeling we are seeing memory related livelock with associated lack of swap space, so I'd expect we'd see high memory PSI counters during the stalls, but it would be useful to verify that.

@ron1

This comment has been minimized.

Copy link

commented Aug 7, 2019

@bengland2 As long as the explicit osd_memory_target is being set in the pod and if it is set to a low enough value, would that not reduce the occurrence of the CGroup limit being exceeded as a work-around until a Nautilus backport of the tracker fix is released?

@bengland2

This comment has been minimized.

Copy link

commented Aug 7, 2019

As I recall the CGroup limit is overriding the osd_memory_target set in the configmap, and therein lies the problem. We want osd_memory_target to be set in relation to the CGroup limit so that we can easily control it with resources/osd/limits/memory field, but we don't want it to be set equal to the Cgroup limit. The proposed fix should do that.

@ron1

This comment has been minimized.

Copy link

commented Aug 7, 2019

@bengland2 What does "overriding" the osd_memory_target value look like?

Again, I seem to be able to successfully use the rook-ceph-override ConfigMap to set the osd_memory_target in the rook-ceph.config file referenced by the ceph-osd process in the osd pod.

Are you saying that the ceph-osd process itself is ignoring the osd_memory_target value in the config file and internally setting it to the CGroup limit?

Or are you saying that you have been unable to use the rook-ceph-override ConfigMap to set the osd_memory_target in the pod rook-ceph.config?

@vitobotta

This comment has been minimized.

Copy link
Author

commented Aug 8, 2019

Hi guys... I am finding it difficult to follow at times :D Are there any other settings etc I could try? Thanks!

@mmgaggle

This comment has been minimized.

Copy link

commented Aug 12, 2019

@bengland2 It appears the osd_memory_target is downwardly adjusted from the container Limit by 80% by default?

osdMemoryTargetSafetyFactor float32 = 0.8

if !osd.IsFileStore {
// As of Nautilus Ceph auto-tunes its osd_memory_target on the fly so we don't need to force it
if !c.clusterInfo.CephVersion.IsAtLeastNautilus() && !c.resources.Limits.Memory().IsZero() {
osdMemoryTargetValue := float32(c.resources.Limits.Memory().Value()) * osdMemoryTargetSafetyFactor
commonArgs = append(commonArgs, fmt.Sprintf("--osd-memory-target=%f", osdMemoryTargetValue))
}
}

@bengland2

This comment has been minimized.

Copy link

commented Aug 12, 2019

@mmgaggle yeah, but Ceph OSD had been overriding its osd_memory_target here. So it didn't matter what rook did. See tracker 41037 .

@ron1

This comment has been minimized.

Copy link

commented Aug 12, 2019

@bengland2 Your post is the information I was looking for. Thanks. So indeed, using the rook configmap override mechanism for bluestore osd_memory_target does not work. However, I suspect it does work for filestore. If so, that would explain the results I have seen with filestore and the results others have reported for bluestore. Are you able to confirm?

@laevilgenius

This comment has been minimized.

Copy link
Contributor

commented Aug 24, 2019

I have similar stalls and apparently they only happen when I'm using fstype: xfs.
(v1.0.0-403.gdd7a82f / v14.2.2-20190722)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
You can’t perform that action at this time.