Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Support Rancher Server Internal Metrics #20341

Closed
ukinau opened this issue May 20, 2019 · 13 comments
Closed

[Feature Request] Support Rancher Server Internal Metrics #20341

ukinau opened this issue May 20, 2019 · 13 comments
Assignees
Labels
kind/feature Issues that represent larger new pieces of functionality, not enhancements to existing functionality
Milestone

Comments

@ukinau
Copy link
Contributor

ukinau commented May 20, 2019

What kind of request is this (question/bug/enhancement/feature request):
feature request

Idea
I want to have /metrics endpoint in rancher server to expose rancher's internal state to make it easy to operate Rancher for bunch of clusters, nodes.
When user use/operate Rancher for multiple kubernetes cluster with many nodes like more than 50 cluster, 1000 nodes.
it's difficult for operator to grasp the internal situation of Rancher like checking if all agent establish websocket session, check if frequency of websocket session disconnected, check owner ship of cluster for user controller and so on. To help operator monitor ranher's detail's behaviour. I hope we can have metrics endpoint.

Feature: Endpoint
Add "https://<rancher-server>/metrics". This endpoint should return metrics information in prometheus format.


Feature: Support Metrics
Metrics type I want to support is followings

Generic Controller Related in Norman (Already in Norman rancher/norman#202 )

  • Total Count of Execution for each handler
  • Total Count of Failure for each handler

Session Manager(remotedialer) Related in Norman (PR has been submitted rancher/norman#285)

  • total_add_websocket_session
    => Total Count of adding websocket session
  • total_remove_websocket_session
    => Total Count of removing websocket session
  • total_add_connections
    => Total count of adding connection
  • total_remove_connections
    => Total count of removing connection
  • total_transmit_bytes
    => Total bytes of transmiting
  • total_transmit_error_bytes
    => Total bytes of transmiting error
  • total_receive_bytes
    => Total bytes of receiving
  • total_peer_ws_attempt
    => Total count of attempt to establish websocket session to other rancher-server
  • total_peer_ws_connected
    => Total count of connected websocket session to other rancher-server
  • total_peer_ws_disconnected
    => Total count of dis-connected websocket session from other rancher-server

ClusterOwner in Rancher

  • Which Rancher Server is owner for specific cluster

Feature: Provide control(New settings) to enable metrics endpoint or not
Enabling Metrics Endpoint cause some performance overhead, and memory consumption.
That's why it's better to give a user choice to enable or disable

@ukinau
Copy link
Contributor Author

ukinau commented May 30, 2019

By the way, actually last part of work to enable rancher internal metrics is not available on PR yet.
That PR need to be sent to rancher/rancher instead of norman and if someone like idea, I will make PR for that last part as well which will include following change

  • clusterowner information like following
cluster_manager_cluster_owner{cluster="c-2c9z2",instance="endpoint:443",job="rancher",owner="10.42.5.46"}  1
cluster_manager_cluster_owner{cluster="c-2mf2d",instance="endpoint:443",job="rancher",owner="10.42.5.46"}  1
cluster_manager_cluster_owner{cluster="c-47jv8",instance="endpoint:443",job="rancher",owner="10.42.5.46"}  0
cluster_manager_cluster_owner{cluster="c-47jv8",instance="endpoint:443",job="rancher",owner="10.42.5.47"}  1
  • Add /metrics endpoint to rancher-server
  • Add new handlder for ClusterController, NodeController so that we can clean up metrics which are belong to deleted cluster and nodes

Is that worth working for this?

@cjellick cjellick added this to the v2.3 milestone Jun 20, 2019
@daxmc99
Copy link
Contributor

daxmc99 commented Jul 3, 2019

@ukinau Yes, I am actively reviewing this PR.
Please make the PR to rancher/rancher as well.

@cjellick
Copy link

Make sure @ibuildthecloud reviews the associated PRs before merging

@cjellick cjellick self-assigned this Aug 9, 2019
@cjellick
Copy link

cjellick commented Aug 9, 2019

I'm going to make my best attempt at getting this merged before v2.3 ships. We are getting close to our code freeze date and Its kind of the bubble to be honest.

@cjellick
Copy link

cjellick commented Aug 9, 2019

@daxmc99 will do a detailed review of the rancher PR and then to meet our timeline, it would be great if you could address his comments by end of next week.

@cjellick cjellick removed their assignment Aug 12, 2019
@cjellick cjellick added the kind/feature Issues that represent larger new pieces of functionality, not enhancements to existing functionality label Aug 19, 2019
@cjellick
Copy link

Unfortunately, this isnt going to make it into v2.3, but we are planning to invest a lot more effort into scaling post v2.3 and I want to take this up then as such detailed monitoring will be critical for that effort.

@cjellick cjellick modified the milestones: v2.3, v2.x - Backlog Aug 27, 2019
@daxmc99 daxmc99 assigned dramich and unassigned daxmc99 Oct 14, 2019
@daxmc99
Copy link
Contributor

daxmc99 commented Oct 14, 2019

PR is here #23181

@cloudnautique
Copy link
Contributor

We should merge the PR containing all of the metrics, and add documentation on how to enable them.

We should setup a scrape config so that if monitoring is deployed into a 'local' cluster, Rancher can be scraped by that instance of prometheus.

A grafana dashboard should be present in the Rancher managed Grafana instance that shows:

  1. Rancher websocket connection information along with bandwidth.
  2. Number of controller executions (possibly in rate format)
  3. Error rates from the controller runs.

This view should be documented so that users should know what information they are looking at, and how it might be useful in viewing the behaviors of Rancher.

@cloudnautique cloudnautique removed their assignment Nov 20, 2019
@cjellick
Copy link

@dramich please make sure there is a docs issue opene dfor this

@dramich
Copy link
Contributor

dramich commented Nov 25, 2019

rancher/rancher-docs#201

@dramich
Copy link
Contributor

dramich commented Dec 4, 2019

Metrics have been added, opened new issue around work to be done for a dashboard #24393

@sangeethah sangeethah assigned izaac and dramich and unassigned dramich and sangeethah Dec 4, 2019
@izaac
Copy link
Contributor

izaac commented Feb 10, 2020

I've covered this in today's Rancher version 2.4 master-head (02/10/2020) commit id: 3fc6715

  • Single install with a static prometheus config and the CATTLE metrics not set. ✅.
  • HA install with service discovery prometheus config with and without CATTLE metrics set. ✅.
  • /metrics endpoint auth with different user roles and authorizations. ✅.
    • Creating a user with User-Base access in Rancher and navigating to the /metrics endpoint with it should throw a 401.
    • Creating a user with User-Base access in Rancher and navigating to the /metrics endpoint with it should return 200 and the metrics in the response body.
    • Creating a Standard User without View Rancher Metrics and navigating to the /metrics endpoint with it should throw a 401.
    • Creating a Standard User without View Rancher Metricsand navigating to the /metrics endpoint with it should return 200 and the metrics in the response body.
    • Admin User should have access by default.
  • Enable and disable CATTLE_PROMETHEUS_METRICS env var in an HA Rancher setup. ✅.
    • Checked the cattle metrics were collected and added to the graphs in Prometheus when the var was true
    • Checked the default metrics were collected and added to the graphs in Prometheus when the var was false

Added setup and test steps details to the Internal Metrics test plan under the 2.4 test plans section.

@izaac
Copy link
Contributor

izaac commented Feb 11, 2020

Metrics are present in 2.4 and QA'd with the coverage described above. Dashboard will be covered in a separate issue: #24393

@izaac izaac closed this as completed Feb 11, 2020
@zube zube bot removed the [zube]: Done label Oct 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Issues that represent larger new pieces of functionality, not enhancements to existing functionality
Projects
None yet
Development

No branches or pull requests

9 participants