[Feature Request] Support Rancher Server Internal Metrics #20341

ukinau · 2019-05-20T18:47:45Z

What kind of request is this (question/bug/enhancement/feature request):
feature request

Idea
I want to have /metrics endpoint in rancher server to expose rancher's internal state to make it easy to operate Rancher for bunch of clusters, nodes.
When user use/operate Rancher for multiple kubernetes cluster with many nodes like more than 50 cluster, 1000 nodes. it's difficult for operator to grasp the internal situation of Rancher like checking if all agent establish websocket session, check if frequency of websocket session disconnected, check owner ship of cluster for user controller and so on. To help operator monitor ranher's detail's behaviour. I hope we can have metrics endpoint.

Feature: Endpoint 
Add "https://<rancher-server>/metrics". This endpoint should return metrics information in prometheus format. 

Feature: Support Metrics
Metrics type I want to support is followings

Generic Controller Related in Norman (Already in Norman rancher/norman#202 )

Total Count of Execution for each handler
Total Count of Failure for each handler

Session Manager(remotedialer) Related in Norman (PR has been submitted rancher/norman#285)

total_add_websocket_session
=> Total Count of adding websocket session
total_remove_websocket_session
=> Total Count of removing websocket session
total_add_connections
=> Total count of adding connection
total_remove_connections
=> Total count of removing connection
total_transmit_bytes
=> Total bytes of transmiting
total_transmit_error_bytes
=> Total bytes of transmiting error
total_receive_bytes
=> Total bytes of receiving
total_peer_ws_attempt
=> Total count of attempt to establish websocket session to other rancher-server
total_peer_ws_connected
=> Total count of connected websocket session to other rancher-server
total_peer_ws_disconnected
=> Total count of dis-connected websocket session from other rancher-server

ClusterOwner in Rancher

Which Rancher Server is owner for specific cluster

Feature: Provide control(New settings) to enable metrics endpoint or not
Enabling Metrics Endpoint cause some performance overhead, and memory consumption. That's why it's better to give a user choice to enable or disable

ukinau · 2019-05-30T09:22:27Z

By the way, actually last part of work to enable rancher internal metrics is not available on PR yet.
That PR need to be sent to rancher/rancher instead of norman and if someone like idea, I will make PR for that last part as well which will include following change

clusterowner information like following

cluster_manager_cluster_owner{cluster="c-2c9z2",instance="endpoint:443",job="rancher",owner="10.42.5.46"}  1
cluster_manager_cluster_owner{cluster="c-2mf2d",instance="endpoint:443",job="rancher",owner="10.42.5.46"}  1
cluster_manager_cluster_owner{cluster="c-47jv8",instance="endpoint:443",job="rancher",owner="10.42.5.46"}  0
cluster_manager_cluster_owner{cluster="c-47jv8",instance="endpoint:443",job="rancher",owner="10.42.5.47"}  1

Add /metrics endpoint to rancher-server
Add new handlder for ClusterController, NodeController so that we can clean up metrics which are belong to deleted cluster and nodes

Is that worth working for this?

daxmc99 · 2019-07-03T23:51:42Z

@ukinau Yes, I am actively reviewing this PR.
Please make the PR to rancher/rancher as well.

cjellick · 2019-07-16T15:55:04Z

Make sure @ibuildthecloud reviews the associated PRs before merging

cjellick · 2019-08-09T15:32:22Z

I'm going to make my best attempt at getting this merged before v2.3 ships. We are getting close to our code freeze date and Its kind of the bubble to be honest.

cjellick · 2019-08-09T15:35:18Z

@daxmc99 will do a detailed review of the rancher PR and then to meet our timeline, it would be great if you could address his comments by end of next week.

cjellick · 2019-08-27T19:07:37Z

Unfortunately, this isnt going to make it into v2.3, but we are planning to invest a lot more effort into scaling post v2.3 and I want to take this up then as such detailed monitoring will be critical for that effort.

daxmc99 · 2019-10-14T17:07:48Z

PR is here #23181

cloudnautique · 2019-11-20T15:34:31Z

We should merge the PR containing all of the metrics, and add documentation on how to enable them.

We should setup a scrape config so that if monitoring is deployed into a 'local' cluster, Rancher can be scraped by that instance of prometheus.

A grafana dashboard should be present in the Rancher managed Grafana instance that shows:

Rancher websocket connection information along with bandwidth.
Number of controller executions (possibly in rate format)
Error rates from the controller runs.

This view should be documented so that users should know what information they are looking at, and how it might be useful in viewing the behaviors of Rancher.

cjellick · 2019-11-25T17:49:45Z

@dramich please make sure there is a docs issue opene dfor this

dramich · 2019-11-25T17:51:50Z

rancher/rancher-docs#201

dramich · 2019-12-04T16:50:30Z

Metrics have been added, opened new issue around work to be done for a dashboard #24393

izaac · 2020-02-10T23:21:30Z

I've covered this in today's Rancher version 2.4 master-head (02/10/2020) commit id: 3fc6715

Single install with a static prometheus config and the CATTLE metrics not set. ✅.
HA install with service discovery prometheus config with and without CATTLE metrics set. ✅.
/metrics endpoint auth with different user roles and authorizations. ✅.
- Creating a user with User-Base access in Rancher and navigating to the /metrics endpoint with it should throw a 401.
- Creating a user with User-Base access in Rancher and navigating to the /metrics endpoint with it should return 200 and the metrics in the response body.
- Creating a Standard User without View Rancher Metrics and navigating to the /metrics endpoint with it should throw a 401.
- Creating a Standard User without View Rancher Metricsand navigating to the /metrics endpoint with it should return 200 and the metrics in the response body.
- Admin User should have access by default.
Enable and disable CATTLE_PROMETHEUS_METRICS env var in an HA Rancher setup. ✅.
- Checked the cattle metrics were collected and added to the graphs in Prometheus when the var was true
- Checked the default metrics were collected and added to the graphs in Prometheus when the var was false

Added setup and test steps details to the Internal Metrics test plan under the 2.4 test plans section.

izaac · 2020-02-11T23:11:40Z

Metrics are present in 2.4 and QA'd with the coverage described above. Dashboard will be covered in a separate issue: #24393

daxmc99 self-assigned this May 23, 2019

daxmc99 added the status/ready-for-review label May 23, 2019

deniseschannon added [zube]: Peer Review and removed status/ready-for-review labels May 28, 2019

deniseschannon mentioned this issue May 28, 2019

Support Session Manager Related Metrics rancher/norman#285

Closed

cjellick added this to the v2.3 milestone Jun 20, 2019

ukinau mentioned this issue Jul 6, 2019

Introduce metrics endpoint in Rancher and Support cluster-owner metrics #21351

Closed

daxmc99 mentioned this issue Sep 15, 2022

Document rancher metrics endpoint rancher/rancher-docs#201

Closed

cjellick self-assigned this Aug 9, 2019

cjellick removed their assignment Aug 12, 2019

cjellick added the kind/feature Issues that represent larger new pieces of functionality, not enhancements to existing functionality label Aug 19, 2019

cjellick modified the milestones: v2.3, v2.x - Backlog Aug 27, 2019

daxmc99 assigned dramich and unassigned daxmc99 Oct 14, 2019

deniseschannon modified the milestones: v2.x - Backlog, v2.4 Nov 14, 2019

dramich assigned cloudnautique Nov 14, 2019

cloudnautique removed their assignment Nov 20, 2019

dramich mentioned this issue Nov 22, 2019

Add Metrics to Rancher #23181

Merged

dramich assigned sangeethah Dec 4, 2019

dramich added the [zube]: To Test label Dec 4, 2019

zube bot removed the [zube]: Peer Review label Dec 4, 2019

dramich mentioned this issue Dec 4, 2019

Rancher metrics dashboard requirements #24393

Closed

sangeethah assigned izaac and dramich and unassigned dramich and sangeethah Dec 4, 2019

davidnuzik removed the team/az label Jan 15, 2020

izaac closed this as completed Feb 11, 2020

zube bot added [zube]: Done and removed [zube]: To Test labels Feb 11, 2020

zube bot removed the [zube]: Done label Oct 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Support Rancher Server Internal Metrics #20341

[Feature Request] Support Rancher Server Internal Metrics #20341

ukinau commented May 20, 2019 •

edited by superseb

ukinau commented May 30, 2019 •

edited

daxmc99 commented Jul 3, 2019

cjellick commented Jul 16, 2019

cjellick commented Aug 9, 2019

cjellick commented Aug 9, 2019

cjellick commented Aug 27, 2019

daxmc99 commented Oct 14, 2019

cloudnautique commented Nov 20, 2019

cjellick commented Nov 25, 2019

dramich commented Nov 25, 2019

dramich commented Dec 4, 2019

izaac commented Feb 10, 2020 •

edited

izaac commented Feb 11, 2020

[Feature Request] Support Rancher Server Internal Metrics #20341

[Feature Request] Support Rancher Server Internal Metrics #20341

Comments

ukinau commented May 20, 2019 • edited by superseb

ukinau commented May 30, 2019 • edited

daxmc99 commented Jul 3, 2019

cjellick commented Jul 16, 2019

cjellick commented Aug 9, 2019

cjellick commented Aug 9, 2019

cjellick commented Aug 27, 2019

daxmc99 commented Oct 14, 2019

cloudnautique commented Nov 20, 2019

cjellick commented Nov 25, 2019

dramich commented Nov 25, 2019

dramich commented Dec 4, 2019

izaac commented Feb 10, 2020 • edited

izaac commented Feb 11, 2020

ukinau commented May 20, 2019 •

edited by superseb

ukinau commented May 30, 2019 •

edited

izaac commented Feb 10, 2020 •

edited