Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: report node metrics using ceph telemetry #12850

Merged
merged 1 commit into from
Nov 16, 2023

Conversation

parth-gr
Copy link
Member

@parth-gr parth-gr commented Sep 5, 2023

Add this reporting with the cephcluster reconcile,
Similar way we reported other telemetry's

Closes: #12344

Description of your changes:

Which issue is resolved by this Pull Request:
Resolves #

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
  • Skip Tests for Docs: If this is only a documentation change, add the label skip-ci on the PR.
  • Reviewed the developer guide on Submitting a Pull Request
  • Pending release notes updated with breaking and/or notable changes for the next minor release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.

pkg/operator/ceph/cluster/nodedaemon/add.go Show resolved Hide resolved
pkg/operator/ceph/cluster/telemetry/telemetry.go Outdated Show resolved Hide resolved
pkg/operator/ceph/cluster/telemetry/telemetry.go Outdated Show resolved Hide resolved
// Report the cephNodeCount
// abc := nodedaemon.CrashCollectorAppName
listoption := metav1.ListOptions{LabelSelector: fmt.Sprintf("%q=%q", k8sutil.AppAttr, "rook-ceph-crashcollector")}
cephNodeList, err := context.Clientset.CoreV1().Nodes().List(clusterInfo.Context, listoption)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a separate call to list the nodes multiple times, just use the same nodeList that was retrieved above on line 287. You can iterate over it to evaluate the labels locally, so no separate query is necessary. One query of nodes is much more efficient than five.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested this? Actually I don't think this will work. The nodes don't have labels on them for the daemons, so we likely need a separate query for each daemon to count how many pods running of each type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need to revert it, doesn't work,

There is one more problem how to disable for first reconcile of cluster creation as

2023-09-07 14:22:12.502209 W | telemetry: failed to set telemetry key "rook/node/count/kubernetes-total". failed to set "rook/node/count/kubernetes-total" in the mon config-key store. output: Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',): exit status 1

Logs this

@@ -151,7 +154,7 @@ func TestCephCSIController(t *testing.T) {

res, err := r.Reconcile(ctx, req)
assert.NoError(t, err)
assert.False(t, res.Requeue)
assert.True(t, res.Requeue)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Madhu-1 do you have any idea about this,
It is sending a reque signal. ANd changing to true fixes it

@parth-gr parth-gr force-pushed the node-telemetry branch 3 times, most recently from d2c474e to d6d24d3 Compare September 7, 2023 14:38
@parth-gr
Copy link
Member Author

parth-gr commented Sep 11, 2023

@travisn would check a cephcluster status or observed generation for getting node telemetry would be a right way?

@@ -263,3 +272,62 @@ func (r *ReconcileCSI) reconcile(request reconcile.Request) (reconcile.Result, e

return reconcileResult, nil
}

func reportNodeTelemetry(context *clusterd.Context, clusterInfo *cephclient.ClusterInfo) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of creating this method during the csi reconcile, we can just implement it inside the existing reportTelemetry method. Then we know the ceph cluster connection is available. I don't see a need to keep it with csi, especially since not all these node metrics are specific to csi.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Labeled by the stale bot label Oct 11, 2023
@travisn travisn removed the stale Labeled by the stale bot label Oct 11, 2023
@parth-gr parth-gr force-pushed the node-telemetry branch 7 times, most recently from 74bbdfe to 8e53785 Compare November 8, 2023 16:20
@parth-gr parth-gr requested a review from travisn November 9, 2023 14:46
// Report the cephNodeCount
// abc := nodedaemon.CrashCollectorAppName
listoption := metav1.ListOptions{LabelSelector: fmt.Sprintf("%s=%s", k8sutil.AppAttr, "rook-ceph-crashcollector")}
cephNodeList, err := c.context.Clientset.CoreV1().Nodes().List(c.ClusterInfo.Context, listoption)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about this comment? I thought we couldn't query the nodes like this to get the daemon count.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

}

// Report the csi rbd node count
listoption = metav1.ListOptions{LabelSelector: fmt.Sprintf("%s=%s", k8sutil.AppAttr, csi.CsiRBDProvisioner)}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the rbd plugin, not the provisioner. There are only two provisioner pods, but the volume plugin would be with the daemonset on (most) nodes.

Suggested change
listoption = metav1.ListOptions{LabelSelector: fmt.Sprintf("%s=%s", k8sutil.AppAttr, csi.CsiRBDProvisioner)}
listoption = metav1.ListOptions{LabelSelector: fmt.Sprintf("%s=%s", k8sutil.AppAttr, csi.CsiRBDPlugin)}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

pkg/operator/ceph/cluster/cluster.go Outdated Show resolved Hide resolved
pkg/operator/ceph/cluster/cluster.go Outdated Show resolved Hide resolved
@parth-gr
Copy link
Member Author

Testing

[rider@localhost examples]$ kubectl logs rook-ceph-operator-6697f899cf-c2znm -nrook-ceph  | grep tele
2023-11-14 11:12:08.031420 I | ceph-cluster-controller: reporting cluster telemetry
2023-11-14 11:12:08.849005 D | telemetry: set telemetry key: rook/version=v1.12.0-alpha.0.441.g765832b08
2023-11-14 11:12:09.397275 D | telemetry: set telemetry key: rook/kubernetes/version=v1.27.4
2023-11-14 11:12:09.934911 D | telemetry: set telemetry key: rook/csi/version=v3.9.0
2023-11-14 11:12:10.403118 D | telemetry: set telemetry key: rook/cluster/mon/max-id=0
2023-11-14 11:12:10.811169 D | telemetry: set telemetry key: rook/cluster/mon/count=1
2023-11-14 11:12:11.232128 D | telemetry: set telemetry key: rook/cluster/mon/allow-multiple-per-node=true
2023-11-14 11:12:11.656466 D | telemetry: set telemetry key: rook/cluster/mon/pvc/enabled=false
2023-11-14 11:12:12.027940 D | telemetry: set telemetry key: rook/cluster/mon/stretch/enabled=false
2023-11-14 11:12:12.438838 D | telemetry: set telemetry key: rook/cluster/storage/device-set/count/total=0
2023-11-14 11:12:12.852234 D | telemetry: set telemetry key: rook/cluster/storage/device-set/count/portable=0
2023-11-14 11:12:13.263211 D | telemetry: set telemetry key: rook/cluster/storage/device-set/count/non-portable=0
2023-11-14 11:12:13.670425 D | telemetry: set telemetry key: rook/cluster/network/provider=
2023-11-14 11:12:14.102317 D | telemetry: set telemetry key: rook/cluster/external-mode=false
2023-11-14 11:12:14.102439 I | ceph-cluster-controller: reporting node telemetry
2023-11-14 11:12:14.504457 D | telemetry: set telemetry key: rook/node/count/kubernetes-total=1
2023-11-14 11:12:14.940929 D | telemetry: set telemetry key: rook/node/count/with-ceph-daemons=0
2023-11-14 11:12:15.415596 D | telemetry: set telemetry key: rook/node/count/with-csi-rbd-plugin=1
2023-11-14 11:12:15.917123 D | telemetry: set telemetry key: rook/node/count/with-csi-cephfs-plugin=1
2023-11-14 11:12:16.405180 D | telemetry: set telemetry key: rook/node/count/with-csi-nfs-plugin=0

@parth-gr
Copy link
Member Author

parth-gr commented Nov 14, 2023

crash-collector doesn't get created always so should we add different way to count ceph nodes


// Report the cephNodeCount
listoption := metav1.ListOptions{LabelSelector: fmt.Sprintf("%s=%s", k8sutil.AppAttr, nodedaemon.CrashCollectorAppName)}
cephNodeList, err := c.context.Clientset.CoreV1().Pods(operatorNamespace).List(c.ClusterInfo.Context, listoption)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want the cluster namespace, not the operator namespace. The crash collectors will be in the same namespace as the cluster.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks

if !kerrors.IsNotFound(err) {
logger.Warningf("failed to report the ceph node count. %v", err)
} else {
telemetry.ReportKeyValue(c.context, c.ClusterInfo, telemetry.CephNodeCount, "-1")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is an error, let's not report it. Then the last reported value will still be set, and no need to set to -1 for an intermittent failure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see we need to report for this case We can get this by counting the number of crash collector pods; however, if any users disable the crash collector, we will report -1 to represent "unknown"

}

// Report the cephNodeCount
if !c.Spec.CrashCollector.Disable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks backwards to me

Suggested change
if !c.Spec.CrashCollector.Disable {
if c.Spec.CrashCollector.Disable {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

telemetry.CephFSNodeCount: "0",
telemetry.RBDNodeCount: "0",
telemetry.NFSNodeCount: "0",
telemetry.CephNodeCount: "-1",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a unit test that checks where the nodes are not 0 or -1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

csiRBDProvisioner = "csi-rbdplugin-provisioner"
csiCephFSProvisioner = "csi-cephfsplugin-provisioner"
csiNFSProvisioner = "csi-nfsplugin-provisioner"
CsiRBDProvisioner = "csi-rbdplugin-provisioner"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we don't need to change these provisioner variables now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope no change needed

Add this reporting with the cephcluster reconcile,
Similar way we reported other telemetry's

Closes: rook#12344

Signed-off-by: parth-gr <paarora@redhat.com>
@travisn travisn merged commit 297e840 into rook:master Nov 16, 2023
51 checks passed
mergify bot added a commit that referenced this pull request Nov 16, 2023
core: report node metrics using ceph telemetry (backport #12850)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement node metrics for Ceph telemetry
3 participants