Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitoring - Prometheus with persistent storage fails to start with CrashLoopBackOff #17256

Open
zidex opened this issue Dec 28, 2018 · 25 comments

Comments

@zidex
Copy link

@zidex zidex commented Dec 28, 2018

What kind of request is this (question/bug/enhancement/feature request):
bug

Environment information
rancher/rancher:v2.2.0-alpha3
k8s v1.12.3

Steps to reproduce (least amount of steps as possible):
Enable cluster monitoring
Enable Persistent Storage for Prometheus

Result:
Prometheus container fails to start with:

CrashLoopBackOff: Back-off 5m0s restarting failed container=prometheus pod=prometheus-cluster-monitoring-0_cattle-prometheus(1f55df92-0a83-11e9-a30c-902b34d02d5c)

Prometheus container log:

level=error ts=2018-12-28T10:52:14.870125298Z caller=main.go:625 err="opening storage failed: create dir: mkdir /prometheus/wal: permission denied"

Additional info:

Prometheus without persistent storage starts successfully
Grafana persistent storage is functional

@thxCode

This comment has been minimized.

Copy link
Member

@thxCode thxCode commented Dec 29, 2018

@zidex , what kind of storage provisioner you using?

@zidex

This comment has been minimized.

Copy link
Author

@zidex zidex commented Dec 29, 2018

I use local storage with Local Node Path plugin

@thxCode

This comment has been minimized.

Copy link
Member

@thxCode thxCode commented Jan 4, 2019

@zidex , I cannot reproduce that. Can you descript more about which environment you were deploying your cluster? Is it a Cloud or Cloud Kubernetes Service or something else?

@zidex

This comment has been minimized.

Copy link
Author

@zidex zidex commented Jan 4, 2019

@thxCode
Steps to reproduce

Virtualbox VM with Ubuntu 16.04.5 Server x64
Docker 17.03.3

docker run -d --name rancher --restart=unless-stopped -p 2080:80 -p 2443:443 -v /srv/rancher:/var/lib/rancher rancher/rancher:v2.2.0-alpha3

Add Cluster (Custom) with default options:
Kubernetes Version: v1.12.3-rancher1-1
Network Provider: Canal
Project Network Isolation: Disabled

Add host:

docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run rancher/rancher-agent:v2.2.0-alpha3 --server https://192.168.1.198:2443 --token jhcvtmgs5vjlxcr4jrqxmrc8r9zcdwl8k8sbhcmcfkhxb9pxr7hvz5 --ca-checksum b71d14f3151aa6028e77daf86da32780e9bbdb764ffdaa59ebb843351512a4a8 --internal-address 192.168.1.198 --etcd --controlplane --worker --label monitoring=true

Add Storage Class

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer

Create directory for local volumes

mkdir /mnt/disks

Create PV
default

Enable Cluster Monitoring
default

@thxCode

This comment has been minimized.

Copy link
Member

@thxCode thxCode commented Jan 4, 2019

@zidex , FYI

Starting Prometheus Operator v0.23.2 with --disable-auto-user-group=false (using in Rancher monitoring preview version) will inject the 'correct' SecurityContext:

https://github.com/coreos/prometheus-operator/blob/27b1eb72d9d93e5ab447121e2b884bc558bde01d/pkg/prometheus/statefulset.go#L334-L340

Entering prometheus-cluster-monitoring Pod:

/prometheus $ id
uid=1000 gid=0(root) groups=2000
/prometheus $ ls -al wal
total 24188
drwxr-xr-x    2 1000     root          4096 Jan  4 08:29 .
drwxrwxrwx    3 root     root          4096 Jan  4 08:29 ..
-rw-r--r--    1 1000     root      24758420 Jan  4 08:37 00000000
/prometheus $ ls -al
total 12
drwxrwxrwx    3 root     root          4096 Jan  4 08:29 .
drwxr-xr-x    1 root     root          4096 Jan  4 08:29 ..
drwxr-xr-x    2 1000     root          4096 Jan  4 08:29 wal

However in v0.26.0, it becomes user customization:

https://github.com/coreos/prometheus-operator/blob/77764ee4239f2a8a7917883651f9e92ea3c52924/pkg/prometheus/statefulset.go#L367-L370

Entering prometheus-cluster-monitoring Pod:

/prometheus $ id
uid=65534(nobody) gid=65534(nogroup)
/prometheus $ ls
wal
/prometheus $ ls -al
total 12
drwxrwxrwx    3 root     root          4096 Jan  4 09:26 .
drwxr-xr-x    1 root     root          4096 Jan  4 09:26 ..
drwxr-xr-x    2 nobody   nogroup       4096 Jan  4 09:26 wal
/prometheus $ ls -al wal
total 15844
drwxr-xr-x    2 nobody   nogroup       4096 Jan  4 09:26 .
drwxrwxrwx    3 root     root          4096 Jan  4 09:26 ..
-rw-r--r--    1 nobody   nogroup   16212770 Jan  4 09:28 00000000

So if you using preview version, don't upgrade Prometheus Operator version. The workaround is dropping the stale Prometheus PVC.

@thxCode

This comment has been minimized.

Copy link
Member

@thxCode thxCode commented Jan 4, 2019

Oh, something mistake by me, I found a similar issue: coreos/prometheus-operator#830 (comment).

@thxCode

This comment has been minimized.

Copy link
Member

@thxCode thxCode commented Jan 4, 2019

@zidex , I think kubernetes.io/no-provisioner Storage Provisioner isn't respecting the SecurityContext, you should grant the appropriate permissions manually. On your steps, you can chown -R 1000:2000 /mnt/drivers/prometheus in the test1604 host.

We will provide a configurable SecurityContext, it defaults to a none-root user with uid 1000 and gid 2000, after rancher/system-charts#8 merged.

thxCode added a commit to thxCode/rancher-system-charts that referenced this issue Jan 4, 2019
**Problem:**
- Previous charts cannot satisfy the project level monitoring deploying design
- Grafana cannot be restarted after password changed
- node-exporter cannot be scheduled to `controlpane` or `etcd` role nodes
- Prometheus cannot be started with PVC that provided by some storage provisioner which don't respect the `SecurityContext`

**Solution:**
- Deploy "project level" monitoring with a permission-limit Prometheus
- Remove Grafana account `Secret` and use provisioning instead of `grafana-watch`
- Modify node-exporter `taints`
- Add configurable `SecurityContext` for Prometheus and Alertmanager

**Issue:**
- rancher/rancher#17039
- rancher/rancher#16962
- rancher/rancher#17030
- rancher/rancher#17256

Co-authored-by: orangedeng <jxfa0043379@hotmail.com>
@zidex

This comment has been minimized.

Copy link
Author

@zidex zidex commented Jan 7, 2019

The directory /mnt/disks/prometheus is not created when monitoring is enabled. I created it with my hands and made chown -R 1000:2000. The problem did not disappear.

I looked at the list of volumes for Pod prometheus-cluster-monitoring, the required volume is missing there. There is only: config, config-out, prometheus-cluster-monitoring-rulefiles-0, secret-exporter-etcd-cert

If you enable data storage for Grafana, the directory is created, but with incorrect permissions. After chmod 777, everything starts working.

@thxCode

This comment has been minimized.

Copy link
Member

@thxCode thxCode commented Jan 8, 2019

@zidex , kubernetes.io/no-provisioner should never create a subpath dir: /prometheus for you, this volume mechanism is managed by Prometheus Operator, you can check the spec of deployed Prometheus StatefulSet.

Back to the actual problem, as you can see on coreos/prometheus-operator#830 (comment), it causes by the Storage Provisioner which doesn't or can't respect the SecurityContext. You can check the application image Dockerfile to ensure running permissions.

Grafana is setting as below, reference:

securityContext:
    fsGroup: 472

Prometheus and Alertmanager will set as below:

securityContext:
    runAsUser: 1000
    runAsNonRoot: true
    fsGroup: 2000
thxCode added a commit to thxCode/rancher-system-charts that referenced this issue Jan 21, 2019
**Problem:**
- Previous charts cannot satisfy the project level monitoring deploying design
- Grafana cannot be restarted after password changed
- node-exporter cannot be scheduled to `controlpane` or `etcd` role nodes
- Prometheus cannot be started with PVC that provided by some storage provisioner which don't respect the `SecurityContext`

**Solution:**
- Deploy "project level" monitoring with a permission-limit Prometheus
- Remove Grafana account `Secret` and use provisioning instead of `grafana-watch`
- Modify node-exporter `taints`
- Add configurable `SecurityContext` for Prometheus and Alertmanager

**Issue:**
- rancher/rancher#17039
- rancher/rancher#16962
- rancher/rancher#17030
- rancher/rancher#17256

Co-authored-by: orangedeng <jxfa0043379@hotmail.com>
thxCode added a commit to thxCode/rancher-system-charts that referenced this issue Jan 25, 2019
**Problem:**
- Previous charts cannot satisfy the project level monitoring deploying design
- Grafana cannot be restarted after password changed
- node-exporter cannot be scheduled to `controlpane` or `etcd` role nodes
- Prometheus cannot be started with PVC that provided by some storage provisioner which don't respect the `SecurityContext`

**Solution:**
- Deploy "project level" monitoring with a permission-limit Prometheus
- Remove Grafana account `Secret` and use provisioning instead of `grafana-watch`
- Modify node-exporter `taints`
- Add configurable `SecurityContext` for Prometheus and Alertmanager
- Narrow Prometheus permission

**Issue:**
- rancher/rancher#17039
- rancher/rancher#16962
- rancher/rancher#17030
- rancher/rancher#17256

Co-authored-by: orangedeng <jxfa0043379@hotmail.com>
thxCode added a commit to thxCode/rancher-system-charts that referenced this issue Jan 25, 2019
**Problem:**
- Previous charts cannot satisfy the project level monitoring deploying design
- Grafana cannot be restarted after password changed
- node-exporter cannot be scheduled to `controlpane` or `etcd` role nodes
- Prometheus cannot be started with PVC that provided by some storage provisioner which don't respect the `SecurityContext`

**Solution:**
- Deploy "project level" monitoring with a permission-limit Prometheus
- Remove Grafana account `Secret` and use provisioning instead of `grafana-watch`
- Modify node-exporter `taints`
- Add configurable `SecurityContext` for Prometheus and Alertmanager
- Narrow Prometheus permission

**Issue:**
- rancher/rancher#17039
- rancher/rancher#16962
- rancher/rancher#17030
- rancher/rancher#17256

Co-authored-by: orangedeng <jxfa0043379@hotmail.com>
thxCode added a commit to thxCode/rancher-system-charts that referenced this issue Jan 26, 2019
**Problem:**
- Previous charts cannot satisfy the project level monitoring deploying design
- Grafana cannot be restarted after password changed
- node-exporter cannot be scheduled to `controlpane` or `etcd` role nodes
- Prometheus cannot be started with PVC that provided by some storage provisioner which don't respect the `SecurityContext`

**Solution:**
- Deploy "project level" monitoring with a permission-limit Prometheus
- Remove Grafana account `Secret` and use provisioning instead of `grafana-watch`
- Modify node-exporter `taints`
- Add configurable `SecurityContext` for Prometheus and Alertmanager
- Narrow Prometheus permission

**Issue:**
- rancher/rancher#17039
- rancher/rancher#16962
- rancher/rancher#17030
- rancher/rancher#17256

Co-authored-by: orangedeng <jxfa0043379@hotmail.com>
thxCode added a commit to thxCode/rancher-system-charts that referenced this issue Jan 27, 2019
**Problem:**
- Previous charts cannot satisfy the project level monitoring deploying design
- Grafana cannot be restarted after password changed
- node-exporter cannot be scheduled to `controlpane` or `etcd` role nodes
- Prometheus cannot be started with PVC that provided by some storage provisioner which don't respect the `SecurityContext`

**Solution:**
- Deploy "project level" monitoring with a permission-limit Prometheus
- Remove Grafana account `Secret` and use provisioning instead of `grafana-watch`
- Modify node-exporter `taints`
- Add configurable `SecurityContext` for Prometheus and Alertmanager
- Narrow Prometheus permission

**Issue:**
- rancher/rancher#17039
- rancher/rancher#16962
- rancher/rancher#17030
- rancher/rancher#17256

Co-authored-by: orangedeng <jxfa0043379@hotmail.com>
thxCode added a commit to thxCode/rancher-system-charts that referenced this issue Jan 27, 2019
**Problem:**
- Previous charts cannot satisfy the project level monitoring deploying design
- Grafana cannot be restarted after password changed
- node-exporter cannot be scheduled to `controlpane` or `etcd` role nodes
- Prometheus cannot be started with PVC that provided by some storage provisioner which don't respect the `SecurityContext`

**Solution:**
- Deploy "project level" monitoring with a permission-limit Prometheus
- Remove Grafana account `Secret` and use provisioning instead of `grafana-watch`
- Modify node-exporter `taints`
- Add configurable `SecurityContext` for Prometheus and Alertmanager
- Narrow Prometheus permission

**Issue:**
- rancher/rancher#17039
- rancher/rancher#16962
- rancher/rancher#17030
- rancher/rancher#17256

Co-authored-by: orangedeng <jxfa0043379@hotmail.com>
thxCode added a commit to thxCode/rancher-system-charts that referenced this issue Jan 29, 2019
**Problem:**
- Previous charts cannot satisfy the project level monitoring deploying design
- Grafana cannot be restarted after password changed
- node-exporter cannot be scheduled to `controlpane` or `etcd` role nodes
- Prometheus cannot be started with PVC that provided by some storage provisioner which don't respect the `SecurityContext`

**Solution:**
- Deploy "project level" monitoring with a permission-limit Prometheus
- Remove Grafana account `Secret` and use provisioning instead of `grafana-watch`
- Modify node-exporter `taints`
- Add configurable `SecurityContext` for Prometheus and Alertmanager
- Narrow Prometheus permission

**Issue:**
- rancher/rancher#17039
- rancher/rancher#16962
- rancher/rancher#17030
- rancher/rancher#17256

Co-authored-by: orangedeng <jxfa0043379@hotmail.com>
thxCode added a commit to thxCode/rancher-system-charts that referenced this issue Jan 29, 2019
**Problem:**
- Previous charts cannot satisfy the project level monitoring deploying design
- Grafana cannot be restarted after password changed
- node-exporter cannot be scheduled to `controlpane` or `etcd` role nodes
- Prometheus cannot be started with PVC that provided by some storage provisioner which don't respect the `SecurityContext`

**Solution:**
- Deploy "project level" monitoring with a permission-limit Prometheus
- Remove Grafana account `Secret` and use provisioning instead of `grafana-watch`
- Modify node-exporter `taints`
- Add configurable `SecurityContext` for Prometheus and Alertmanager
- Narrow Prometheus permission

**Issue:**
- rancher/rancher#17039
- rancher/rancher#16962
- rancher/rancher#17030
- rancher/rancher#17256

Co-authored-by: orangedeng <jxfa0043379@hotmail.com>
thxCode added a commit to thxCode/rancher-system-charts that referenced this issue Jan 29, 2019
**Problem:**
- Previous charts cannot satisfy the project level monitoring deploying design
- Grafana cannot be restarted after password changed
- node-exporter cannot be scheduled to `controlpane` or `etcd` role nodes
- Prometheus cannot be started with PVC that provided by some storage provisioner which don't respect the `SecurityContext`

**Solution:**
- Deploy "project level" monitoring with a permission-limit Prometheus
- Remove Grafana account `Secret` and use provisioning instead of `grafana-watch`
- Modify node-exporter `taints`
- Add configurable `SecurityContext` for Prometheus and Alertmanager
- Narrow Prometheus permission

**Issue:**
- rancher/rancher#17039
- rancher/rancher#16962
- rancher/rancher#17030
- rancher/rancher#17256

Co-authored-by: orangedeng <jxfa0043379@hotmail.com>
thxCode added a commit to thxCode/rancher-system-charts that referenced this issue Feb 3, 2019
**Problem:**
- Previous charts cannot satisfy the project level monitoring deploying design
- Grafana cannot be restarted after password changed
- node-exporter cannot be scheduled to `controlpane` or `etcd` role nodes
- Prometheus cannot be started with PVC that provided by some storage provisioner which don't respect the `SecurityContext`

**Solution:**
- Deploy "project level" monitoring with a permission-limit Prometheus
- Remove Grafana account `Secret` and use provisioning instead of `grafana-watch`
- Modify node-exporter `taints`
- Add configurable `SecurityContext` for Prometheus and Alertmanager
- Narrow Prometheus permission

**Issue:**
- rancher/rancher#17039
- rancher/rancher#16962
- rancher/rancher#17030
- rancher/rancher#17256

Co-authored-by: orangedeng <jxfa0043379@hotmail.com>
thxCode added a commit to thxCode/rancher-system-charts that referenced this issue Feb 3, 2019
**Problem:**
- Previous charts cannot satisfy the project level monitoring deploying design
- Grafana cannot be restarted after password changed
- node-exporter cannot be scheduled to `controlpane` or `etcd` role nodes
- Prometheus cannot be started with PVC that provided by some storage provisioner which don't respect the `SecurityContext`

**Solution:**
- Deploy "project level" monitoring with a permission-limit Prometheus
- Remove Grafana account `Secret` and use provisioning instead of `grafana-watch`
- Modify node-exporter `taints`
- Add configurable `SecurityContext` for Prometheus and Alertmanager
- Narrow Prometheus permission

**Issue:**
- rancher/rancher#17039
- rancher/rancher#16962
- rancher/rancher#17030
- rancher/rancher#17256

Co-authored-by: orangedeng <jxfa0043379@hotmail.com>
cjellick added a commit to rancher/system-charts that referenced this issue Feb 12, 2019
**Problem:**
- Previous charts cannot satisfy the project level monitoring deploying design
- Grafana cannot be restarted after password changed
- node-exporter cannot be scheduled to `controlpane` or `etcd` role nodes
- Prometheus cannot be started with PVC that provided by some storage provisioner which don't respect the `SecurityContext`

**Solution:**
- Deploy "project level" monitoring with a permission-limit Prometheus
- Remove Grafana account `Secret` and use provisioning instead of `grafana-watch`
- Modify node-exporter `taints`
- Add configurable `SecurityContext` for Prometheus and Alertmanager
- Narrow Prometheus permission

**Issue:**
- rancher/rancher#17039
- rancher/rancher#16962
- rancher/rancher#17030
- rancher/rancher#17256

Co-authored-by: orangedeng <jxfa0043379@hotmail.com>
@thxCode

This comment has been minimized.

Copy link
Member

@thxCode thxCode commented Feb 15, 2019

@zidex, is there any other problem with this?

@loganhz

This comment has been minimized.

Copy link
Member

@loganhz loganhz commented May 22, 2019

Please let us know in a comment if you can still reproduce the issue, and we'll reopen it

@loganhz loganhz closed this May 22, 2019
@danielhass

This comment has been minimized.

Copy link

@danielhass danielhass commented Jul 5, 2019

@loganhz I'm still experiencing this issue. After activating the Cluster Monitoring the following errors pop up in the prometheus container log:

level=warn ts=2019-07-05T09:30:32.887785616Z caller=main.go:295 deprecation_notice="\"storage.tsdb.retention\" flag is deprecated use \"storage.tsdb.retention.time\" instead."
level=info ts=2019-07-05T09:30:32.887845114Z caller=main.go:302 msg="Starting Prometheus" version="(version=2.7.1, branch=HEAD, revision=62e591f928ddf6b3468308b7ac1de1c63aa7fcf3)"
level=info ts=2019-07-05T09:30:32.887865005Z caller=main.go:303 build_context="(go=go1.11.5, user=root@f9f82868fc43, date=20190131-11:16:59)"
level=info ts=2019-07-05T09:30:32.887883431Z caller=main.go:304 host_details="(Linux 4.4.162-94.72-default #1 SMP Mon Nov 12 18:57:45 UTC 2018 (9de753f) x86_64 prometheus-cluster-monitoring-0 (none))"
level=info ts=2019-07-05T09:30:32.887901595Z caller=main.go:305 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2019-07-05T09:30:32.887917069Z caller=main.go:306 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2019-07-05T09:30:32.894063049Z caller=main.go:620 msg="Starting TSDB ..."
level=info ts=2019-07-05T09:30:32.894324467Z caller=main.go:489 msg="Stopping scrape discovery manager..."
level=info ts=2019-07-05T09:30:32.894338028Z caller=main.go:503 msg="Stopping notify discovery manager..."
level=info ts=2019-07-05T09:30:32.894343901Z caller=main.go:525 msg="Stopping scrape manager..."
level=info ts=2019-07-05T09:30:32.894351197Z caller=main.go:499 msg="Notify discovery manager stopped"
level=info ts=2019-07-05T09:30:32.894369838Z caller=main.go:485 msg="Scrape discovery manager stopped"
level=info ts=2019-07-05T09:30:32.894382328Z caller=manager.go:736 component="rule manager" msg="Stopping rule manager..."
level=info ts=2019-07-05T09:30:32.894394025Z caller=manager.go:742 component="rule manager" msg="Rule manager stopped"
level=info ts=2019-07-05T09:30:32.894402025Z caller=notifier.go:521 component=notifier msg="Stopping notification manager..."
level=info ts=2019-07-05T09:30:32.894409066Z caller=main.go:679 msg="Notifier manager stopped"
level=info ts=2019-07-05T09:30:32.894422316Z caller=main.go:519 msg="Scrape manager stopped"
level=info ts=2019-07-05T09:30:32.894454406Z caller=web.go:416 component=web msg="Start listening for connections" address=127.0.0.1:9090
level=error ts=2019-07-05T09:30:32.894921303Z caller=main.go:688 err="opening storage failed: create dir: mkdir /prometheus/wal: permission denied"

Running on Rancher v2.2.4 with Custom Nodes.
Prom Container image: rancher/prom-prometheus:v2.7.1
Persistent storage for Monitoring: error happens with disabled and enabled.

Any ideas on this?

@loganhz

This comment has been minimized.

Copy link
Member

@loganhz loganhz commented Jul 5, 2019

@adeleglise

This comment has been minimized.

Copy link

@adeleglise adeleglise commented Aug 29, 2019

Hi, I'm facing the same issue, using latest stable rancher and monitoring chart 0.0.3.

 prometheus level=warn ts=2019-08-29T09:42:00.685168906Z caller=main.go:295 deprecation_notice="\"storage.tsdb.retention\" flag is deprecated use \"storage.tsdb.retention.time\" instead."                       │
│ prometheus level=info ts=2019-08-29T09:42:00.68527615Z caller=main.go:302 msg="Starting Prometheus" version="(version=2.7.1, branch=HEAD, revision=62e591f928ddf6b3468308b7ac1de1c63aa7fcf3)"                    │
│ prometheus level=info ts=2019-08-29T09:42:00.685304323Z caller=main.go:303 build_context="(go=go1.11.5, user=root@f9f82868fc43, date=20190131-11:16:59)"                                                         │
│ prometheus level=info ts=2019-08-29T09:42:00.685331767Z caller=main.go:304 host_details="(Linux 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019 x86_64 prometheus-cluster-monitoring-0 (none))"   │
│ prometheus level=info ts=2019-08-29T09:42:00.68535965Z caller=main.go:305 fd_limits="(soft=1048576, hard=1048576)"                                                                                               │
│ prometheus level=info ts=2019-08-29T09:42:00.685383996Z caller=main.go:306 vm_limits="(soft=unlimited, hard=unlimited)"                                                                                          │
│ prometheus level=info ts=2019-08-29T09:42:00.68622938Z caller=main.go:620 msg="Starting TSDB ..."                                                                                                                │
│ prometheus level=info ts=2019-08-29T09:42:00.686280478Z caller=web.go:416 component=web msg="Start listening for connections" address=127.0.0.1:9090                                                             │
│ prometheus level=info ts=2019-08-29T09:42:00.686568373Z caller=main.go:489 msg="Stopping scrape discovery manager..."                                                                                            │
│ prometheus level=info ts=2019-08-29T09:42:00.686585633Z caller=main.go:503 msg="Stopping notify discovery manager..."                                                                                            │
│ prometheus level=info ts=2019-08-29T09:42:00.686596003Z caller=main.go:525 msg="Stopping scrape manager..."                                                                                                      │
│ prometheus level=info ts=2019-08-29T09:42:00.686604355Z caller=main.go:499 msg="Notify discovery manager stopped"                                                                                                │
│ prometheus level=info ts=2019-08-29T09:42:00.686630871Z caller=main.go:485 msg="Scrape discovery manager stopped"                                                                                                │
│ prometheus level=info ts=2019-08-29T09:42:00.686647036Z caller=manager.go:736 component="rule manager" msg="Stopping rule manager..."                                                                            │
│ prometheus level=info ts=2019-08-29T09:42:00.686664337Z caller=manager.go:742 component="rule manager" msg="Rule manager stopped"                                                                                │
│ prometheus level=info ts=2019-08-29T09:42:00.686680133Z caller=notifier.go:521 component=notifier msg="Stopping notification manager..."                                                                         │
│ prometheus level=info ts=2019-08-29T09:42:00.686693057Z caller=main.go:679 msg="Notifier manager stopped"                                                                                                        │
│ prometheus level=info ts=2019-08-29T09:42:00.686653567Z caller=main.go:519 msg="Scrape manager stopped"                                                                                                          │
│ prometheus level=error ts=2019-08-29T09:42:00.687009615Z caller=main.go:688 err="opening storage failed: create dir: mkdir /prometheus/wal: permission denied"
@jhu-arod

This comment has been minimized.

Copy link

@jhu-arod jhu-arod commented Sep 10, 2019

I'm seeing the same issue with monitoring chart 0.0.3 as well. I tried manually adding write permissions to group and other but the end result is still the same.

level=warn ts=2019-09-10T17:36:33.962589882Z caller=main.go:295 deprecation_notice="\"storage.tsdb.retention\" flag is deprecated use \"storage.tsdb.retention.time\" instead."
level=info ts=2019-09-10T17:36:33.96265445Z caller=main.go:302 msg="Starting Prometheus" version="(version=2.7.1, branch=HEAD, revision=62e591f928ddf6b3468308b7ac1de1c63aa7fcf3)"
level=info ts=2019-09-10T17:36:33.962680955Z caller=main.go:303 build_context="(go=go1.11.5, user=root@f9f82868fc43, date=20190131-11:16:59)"
level=info ts=2019-09-10T17:36:33.962710906Z caller=main.go:304 host_details="(Linux 4.14.138-114.102.amzn2.x86_64 #1 SMP Thu Aug 15 15:29:58 UTC 2019 x86_64 prometheus-cluster-monitoring-0 (none))"
level=info ts=2019-09-10T17:36:33.962738966Z caller=main.go:305 fd_limits="(soft=1024, hard=4096)"
level=info ts=2019-09-10T17:36:33.962761888Z caller=main.go:306 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2019-09-10T17:36:33.965308413Z caller=main.go:620 msg="Starting TSDB ..."
level=info ts=2019-09-10T17:36:33.965471525Z caller=web.go:416 component=web msg="Start listening for connections" address=127.0.0.1:9090
level=info ts=2019-09-10T17:36:33.965560432Z caller=main.go:489 msg="Stopping scrape discovery manager..."
level=info ts=2019-09-10T17:36:33.965572576Z caller=main.go:503 msg="Stopping notify discovery manager..."
level=info ts=2019-09-10T17:36:33.965581608Z caller=main.go:525 msg="Stopping scrape manager..."
level=info ts=2019-09-10T17:36:33.965591348Z caller=main.go:499 msg="Notify discovery manager stopped"
level=info ts=2019-09-10T17:36:33.965613126Z caller=main.go:519 msg="Scrape manager stopped"
level=info ts=2019-09-10T17:36:33.965628668Z caller=main.go:485 msg="Scrape discovery manager stopped"
level=info ts=2019-09-10T17:36:33.965644908Z caller=manager.go:736 component="rule manager" msg="Stopping rule manager..."
level=info ts=2019-09-10T17:36:33.965654212Z caller=manager.go:742 component="rule manager" msg="Rule manager stopped"
level=info ts=2019-09-10T17:36:33.965664281Z caller=notifier.go:521 component=notifier msg="Stopping notification manager..."
level=info ts=2019-09-10T17:36:33.965673714Z caller=main.go:679 msg="Notifier manager stopped"
level=error ts=2019-09-10T17:36:33.966146198Z caller=main.go:688 err="opening storage failed: create dir: mkdir /prometheus/wal: permission denied"
@ilusharulkov

This comment has been minimized.

Copy link

@ilusharulkov ilusharulkov commented Sep 12, 2019

I have the same issue: (Monitoring Version: 0.0.3)

level=info ts=2019-09-12T08:01:10.622127469Z caller=main.go:302 msg="Starting Prometheus" version="(version=2.7.1, branch=HEAD, revision=62e591f928ddf6b3468308b7ac1de1c63aa7fcf3)"
level=info ts=2019-09-12T08:01:10.622156309Z caller=main.go:303 build_context="(go=go1.11.5, user=root@f9f82868fc43, date=20190131-11:16:59)"
level=info ts=2019-09-12T08:01:10.622182819Z caller=main.go:304 host_details="(Linux 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64 prometheus-project-monitoring-0 (none))"
level=info ts=2019-09-12T08:01:10.622209901Z caller=main.go:305 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2019-09-12T08:01:10.622234094Z caller=main.go:306 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2019-09-12T08:01:10.623284441Z caller=main.go:620 msg="Starting TSDB ..."
level=info ts=2019-09-12T08:01:10.623459374Z caller=web.go:416 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2019-09-12T08:01:10.623643635Z caller=main.go:489 msg="Stopping scrape discovery manager..."
level=info ts=2019-09-12T08:01:10.623660387Z caller=main.go:503 msg="Stopping notify discovery manager..."
level=info ts=2019-09-12T08:01:10.623670473Z caller=main.go:525 msg="Stopping scrape manager..."
level=info ts=2019-09-12T08:01:10.623680553Z caller=main.go:499 msg="Notify discovery manager stopped"
level=info ts=2019-09-12T08:01:10.623707417Z caller=main.go:519 msg="Scrape manager stopped"
level=info ts=2019-09-12T08:01:10.623724734Z caller=main.go:485 msg="Scrape discovery manager stopped"
level=info ts=2019-09-12T08:01:10.623740262Z caller=manager.go:736 component="rule manager" msg="Stopping rule manager..."
level=info ts=2019-09-12T08:01:10.623753898Z caller=manager.go:742 component="rule manager" msg="Rule manager stopped"
level=info ts=2019-09-12T08:01:10.623774348Z caller=notifier.go:521 component=notifier msg="Stopping notification manager..."
level=info ts=2019-09-12T08:01:10.623792951Z caller=main.go:679 msg="Notifier manager stopped"
level=error ts=2019-09-12T08:01:10.624186968Z caller=main.go:688 err="opening storage failed: create dir: mkdir /prometheus/wal: permission denied"```
grafana starts ok
@adeleglise

This comment has been minimized.

Copy link

@adeleglise adeleglise commented Sep 20, 2019

@loganhz Do you think someone could have a look on this one ?

@thxCode

This comment has been minimized.

Copy link
Member

@thxCode thxCode commented Sep 26, 2019

@adeleglise , @ilusharulkov , @jhu-arod , have you guys use with the local-path storage class?

for this issue, you could docker ps -a | grep prometheus to find out the exited Prometheus container and then docker inspect that container ID to confirm the mounted path of /prometheus, for example:

"Mounts": [
 ...
            {
                "Type": "bind",
                "Source": "/var/lib/kubelet/pods/3f52e135-e00c-11e9-a6de-fa163ef1e7a9/volume-subpaths/pvc-78656acf-dfb2-11e9-a6de-fa163ef1e7a9/prometheus/1",
                "Destination": "/prometheus",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            },
...
]

make sure the source path has the permission: 1000:2000:

ls /var/lib/kubelet/pods/3f52e135-e00c-11e9-a6de-fa163ef1e7a9/volume-subpaths/pvc-78656acf-dfb2-11e9-a6de-fa163ef1e7a9/prometheus/1 -al
total 12
drwxr-xr-x 3 cloud 2000 4096 Sep 26 03:23 .
drwxr-x--- 3 root  root 4096 Sep 26 03:18 ..
drwxr-xr-x 2 cloud root 4096 Sep 26 03:23 wal

I guess you are facing the same issue: rancher/local-path-provisioner#4 (comment)

@Leen15

This comment has been minimized.

Copy link

@Leen15 Leen15 commented Sep 26, 2019

Hi, same issue here using local-path-provisioner.

@thxCode

This comment has been minimized.

Copy link
Member

@thxCode thxCode commented Sep 26, 2019

@adeleglise , @ilusharulkov , @jhu-arod , @Leen15, could you try to follow this workaround #14836 (comment) to configure your cluster? I think you don't need to mount the host root path, you just mount the target path (default is /opt/local-path-provisioner) of rancher/local-path-provisioner: https://github.com/rancher/local-path-provisioner#configuration, for example, edit your custom cluster config:

services:
  kubelet:
    extra_args:
      containerized: "true"
    extra_binds: 
      - "/opt/local-path-provisioner:/rootfs/opt/local-path-provisioner:rshared"
@adeleglise

This comment has been minimized.

Copy link

@adeleglise adeleglise commented Oct 17, 2019

Hi @thxCode .

I've tried this configuration in my rancher-cluster.yaml.

rke fails to modify the cluster with the error :

FATA[0135] [workerPlane] Failed to bring up Worker Plane: [Failed to verify healthcheck: Failed to check http://localhost:10248/healthz for service [kubelet] on host [REDACTED_IP]: Get http://localhost:10248/healthz: Unable to access the service on localhost:10248. The service might be still starting up. Error: ssh: rejected: connect failed (Connection refused), log: F1017 09:29:26.301140   21314 server.go:249] unable to find mount]
@granttr

This comment has been minimized.

Copy link

@granttr granttr commented Nov 1, 2019

Please let us know in a comment if you can still reproduce the issue, and we'll reopen it

As posted earlier, after this issue was closed, a similar issue has been reproduced several times. I've tested that it can be easily reproduced in monitoring versions 0.0.4 and 0.0.5 also.

Is this problem specific to using the local-path-provisioner storageClass for persisting the monitoring data? (could be related to rancher/local-path-provisioner#41)

Since persistence doesn't work at all when using local-path - Should this ticket be reopened pending some fix for the local-path-provisioner?

Thanks

@loganhz loganhz reopened this Nov 1, 2019
@adeleglise

This comment has been minimized.

Copy link

@adeleglise adeleglise commented Nov 14, 2019

I've tried with longhorn, and I can still see this bug. Using 0.0.5 version of this chart, and last version of longhorn.

@thxCode

This comment has been minimized.

Copy link
Member

@thxCode thxCode commented Nov 15, 2019

@adeleglise , I could not reproduce this following the below steps:

  1. Setup v2.3.2 Rancher Server
  2. Bring up a v1.15.5 Kubernetes cluster from 3 exiting nodes: 1 all role + 2 worker roles
  3. Install v0.6.2 Longhorn App from System Project
  4. Configure Prometheus instance with longhorn StorageClass during enabling cluster-level Monitoring

image

image

@adeleglise

This comment has been minimized.

Copy link

@adeleglise adeleglise commented Nov 15, 2019

@thxCode Yup I confirm. The problem was that the pvc was still here and using Local Node Path. I've deleted it, and recreate the prometheus pod. It's now working with longhorn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
10 participants
You can’t perform that action at this time.