Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/prometheus skipper metrics #861

Merged
merged 9 commits into from
Mar 12, 2018

Conversation

szuecs
Copy link
Member

@szuecs szuecs commented Feb 21, 2018

This is a breaking change please make sure we only roll this out to dev and not to alpha and beta, before all important zmon checks and alerts have replacements.

This feature enables us to have better aggregated metrics, that are not as complicated as before.
It should also fix the problem of skipper-ingress pod restart will loose all metrics and the aggregation looks awkward.

I am not sure if we should add more prometheus replicas and use a service to make it more resilient to cluster updates, etc.

spec:
containers:
- name: prometheus
image: prom/prometheus:latest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a compliant image here.

@mikkeloscar
Copy link
Contributor

@szuecs can we switch one of the dev clusters to this branch and prepare the checks before we start rolling this into dev and further?

@szuecs
Copy link
Member Author

szuecs commented Feb 22, 2018

@mikkeloscar sure!

@szuecs
Copy link
Member Author

szuecs commented Feb 26, 2018

volume mount issue: kubernetes/kubernetes#2630
Prometheus is using nobody as user inside the container and kubernetes does not support it.

This is a fix for the volume problem

  securityContext:
    runAsUser: 65534
    fsGroup: 65534

emptyDir: {}
securityContext:
runAsUser: 65534
fsGroup: 65534
Copy link
Contributor

@mikkeloscar mikkeloscar Feb 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment on why this is needed would be good for the future :)

I guess 65534 -> nobody?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I will not change it, because if the nobody user get 42 as uid then this container can not read and write data anymore, because of uid miss match (media has 65534 and uid 42), which should be in theory changed, but better to be sure.

@szuecs
Copy link
Member Author

szuecs commented Mar 1, 2018

Current state is that I found solutions for all metrics we want to provide.

- [X] http5xx_count           round(http5xx_errors_per_sec * 60, 0)
- [X] http5xx_errors_per_sec  (6) sum(rate(skipper_serve_host_duration_seconds_count{}[1m])) by (host, code)
- [X] http5xx_rate            min(round(http5xx_count * 100. / requests_count, 4), 100.) if requests_count > 0 else 0
- [X] http4xx_count           round(http4xx_errors_per_sec * 60, 0)
- [X] http4xx_errors_per_sec  (6) sum(rate(skipper_serve_host_duration_seconds_count{}[1m])) by (host, code)
- [X] http4xx_rate            min(round(http4xx_count * 100. / requests_count, 4), 100.) if requests_count > 0 else 0
- [X] requests_count          round(requests_per_sec * 60, 0)
- [X] requests_per_sec        (7) or sum_code(6) - sum(rate(skipper_serve_host_duration_seconds_count{}[1m])) by (host)
- [X] latency_ms              (5) sum(rate(skipper_serve_host_duration_seconds_sum{}[1m])) by (host) / sum(rate(skipper_serve_host_duration_seconds_count{}[1m])) by (host)
- [X] latency_p90_ms          (4) histogram_quantile(0.90, sum(rate(skipper_serve_host_duration_seconds_bucket{}[1m])) by (le, host) )
- [X] latency_p95_ms          (3) histogram_quantile(0.95, sum(rate(skipper_serve_host_duration_seconds_bucket{}[1m])) by (le, host) )
- [X] latency_p99_ms          (2) histogram_quantile(0.99, sum(rate(skipper_serve_host_duration_seconds_bucket{}[1m])) by (le, host) )
- [X] latency_p999_ms         (1) histogram_quantile(0.999, sum(rate(skipper_serve_host_duration_seconds_bucket{}[1m])) by (le, host) )
# zmon can calculate on prom data
- [X] http_error_per_sec          # http4xx_errors_per_sec + http5xx_errors_per_sec
- [X] http_error_count            # http4xx_count + http5xx_count
- [X] http_error_rate             # http4xx_rate + http5xx_rate

@szuecs
Copy link
Member Author

szuecs commented Mar 8, 2018

zalando/skipper#569 implemented a soft migration path for enabling users to have both versions of metrics exposed. We can do now the soft migration from codahale to prometheus.

@szuecs szuecs force-pushed the feature/prometheus-skipper-metrics branch from e7d4bb2 to ec16fdd Compare March 9, 2018 17:19
@szuecs szuecs mentioned this pull request Mar 9, 2018
@szuecs
Copy link
Member Author

szuecs commented Mar 12, 2018

👍

@@ -0,0 +1,61 @@
apiVersion: apps/v1beta1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should now be apps/v1 (since v1.9)

metadata:
annotations:
labels:
application: prometheus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have the version label here to be consistent with the other manifests. :)

template:
metadata:
labels:
application: prometheus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also version label here :)

@szuecs
Copy link
Member Author

szuecs commented Mar 12, 2018

comments will be fixed in a separate PR

@mikkeloscar
Copy link
Contributor

👍

@mikkeloscar mikkeloscar merged commit d534c69 into dev Mar 12, 2018
@mikkeloscar mikkeloscar deleted the feature/prometheus-skipper-metrics branch March 12, 2018 10:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants