core: report node metrics using ceph telemetry #12850

parth-gr · 2023-09-05T13:00:43Z

Add this reporting with the cephcluster reconcile,
Similar way we reported other telemetry's

Closes: #12344

Description of your changes:

Which issue is resolved by this Pull Request:
Resolves #

Checklist:

Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
Skip Tests for Docs: If this is only a documentation change, add the label skip-ci on the PR.
Reviewed the developer guide on Submitting a Pull Request
Pending release notes updated with breaking and/or notable changes for the next minor release.
Documentation has been updated, if necessary.
Unit tests have been added, if necessary.
Integration tests have been added, if necessary.

pkg/operator/ceph/cluster/nodedaemon/add.go

pkg/operator/ceph/cluster/telemetry/telemetry.go

travisn · 2023-09-05T20:52:15Z

pkg/operator/ceph/csi/controller.go

+	// Report the cephNodeCount
+	// abc := nodedaemon.CrashCollectorAppName
+	listoption := metav1.ListOptions{LabelSelector: fmt.Sprintf("%q=%q", k8sutil.AppAttr, "rook-ceph-crashcollector")}
+	cephNodeList, err := context.Clientset.CoreV1().Nodes().List(clusterInfo.Context, listoption)


Instead of a separate call to list the nodes multiple times, just use the same nodeList that was retrieved above on line 287. You can iterate over it to evaluate the labels locally, so no separate query is necessary. One query of nodes is much more efficient than five.

Have you tested this? Actually I don't think this will work. The nodes don't have labels on them for the daemons, so we likely need a separate query for each daemon to count how many pods running of each type.

Yes, we need to revert it, doesn't work,

There is one more problem how to disable for first reconcile of cluster creation as

2023-09-07 14:22:12.502209 W | telemetry: failed to set telemetry key "rook/node/count/kubernetes-total". failed to set "rook/node/count/kubernetes-total" in the mon config-key store. output: Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',): exit status 1

Logs this

pkg/operator/ceph/csi/controller.go

parth-gr · 2023-09-07T13:53:58Z

pkg/operator/ceph/csi/controller_test.go

@@ -151,7 +154,7 @@ func TestCephCSIController(t *testing.T) {

 		res, err := r.Reconcile(ctx, req)
 		assert.NoError(t, err)
-		assert.False(t, res.Requeue)
+		assert.True(t, res.Requeue)


@Madhu-1 do you have any idea about this,
It is sending a reque signal. ANd changing to true fixes it

parth-gr · 2023-09-11T14:30:18Z

@travisn would check a cephcluster status or observed generation for getting node telemetry would be a right way?

travisn · 2023-09-11T14:43:53Z

pkg/operator/ceph/csi/controller.go

@@ -263,3 +272,62 @@ func (r *ReconcileCSI) reconcile(request reconcile.Request) (reconcile.Result, e

 	return reconcileResult, nil
 }
+
+func reportNodeTelemetry(context *clusterd.Context, clusterInfo *cephclient.ClusterInfo) {


Instead of creating this method during the csi reconcile, we can just implement it inside the existing reportTelemetry method. Then we know the ceph cluster connection is available. I don't see a need to keep it with csi, especially since not all these node metrics are specific to csi.

github-actions · 2023-10-11T20:02:01Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.

pkg/operator/ceph/csi/controller.go

travisn · 2023-11-10T16:33:11Z

pkg/operator/ceph/cluster/cluster.go

+	// Report the cephNodeCount
+	// abc := nodedaemon.CrashCollectorAppName
+	listoption := metav1.ListOptions{LabelSelector: fmt.Sprintf("%s=%s", k8sutil.AppAttr, "rook-ceph-crashcollector")}
+	cephNodeList, err := c.context.Clientset.CoreV1().Nodes().List(c.ClusterInfo.Context, listoption)


What about this comment? I thought we couldn't query the nodes like this to get the daemon count.

travisn · 2023-11-13T17:37:57Z

pkg/operator/ceph/cluster/cluster.go

+	}
+
+	// Report the csi rbd node count
+	listoption = metav1.ListOptions{LabelSelector: fmt.Sprintf("%s=%s", k8sutil.AppAttr, csi.CsiRBDProvisioner)}


We need the rbd plugin, not the provisioner. There are only two provisioner pods, but the volume plugin would be with the daemonset on (most) nodes.

Suggested change

listoption = metav1.ListOptions{LabelSelector: fmt.Sprintf("%s=%s", k8sutil.AppAttr, csi.CsiRBDProvisioner)}

listoption = metav1.ListOptions{LabelSelector: fmt.Sprintf("%s=%s", k8sutil.AppAttr, csi.CsiRBDPlugin)}

pkg/operator/ceph/cluster/cluster.go

parth-gr · 2023-11-14T11:12:54Z

Testing

[rider@localhost examples]$ kubectl logs rook-ceph-operator-6697f899cf-c2znm -nrook-ceph  | grep tele
2023-11-14 11:12:08.031420 I | ceph-cluster-controller: reporting cluster telemetry
2023-11-14 11:12:08.849005 D | telemetry: set telemetry key: rook/version=v1.12.0-alpha.0.441.g765832b08
2023-11-14 11:12:09.397275 D | telemetry: set telemetry key: rook/kubernetes/version=v1.27.4
2023-11-14 11:12:09.934911 D | telemetry: set telemetry key: rook/csi/version=v3.9.0
2023-11-14 11:12:10.403118 D | telemetry: set telemetry key: rook/cluster/mon/max-id=0
2023-11-14 11:12:10.811169 D | telemetry: set telemetry key: rook/cluster/mon/count=1
2023-11-14 11:12:11.232128 D | telemetry: set telemetry key: rook/cluster/mon/allow-multiple-per-node=true
2023-11-14 11:12:11.656466 D | telemetry: set telemetry key: rook/cluster/mon/pvc/enabled=false
2023-11-14 11:12:12.027940 D | telemetry: set telemetry key: rook/cluster/mon/stretch/enabled=false
2023-11-14 11:12:12.438838 D | telemetry: set telemetry key: rook/cluster/storage/device-set/count/total=0
2023-11-14 11:12:12.852234 D | telemetry: set telemetry key: rook/cluster/storage/device-set/count/portable=0
2023-11-14 11:12:13.263211 D | telemetry: set telemetry key: rook/cluster/storage/device-set/count/non-portable=0
2023-11-14 11:12:13.670425 D | telemetry: set telemetry key: rook/cluster/network/provider=
2023-11-14 11:12:14.102317 D | telemetry: set telemetry key: rook/cluster/external-mode=false
2023-11-14 11:12:14.102439 I | ceph-cluster-controller: reporting node telemetry
2023-11-14 11:12:14.504457 D | telemetry: set telemetry key: rook/node/count/kubernetes-total=1
2023-11-14 11:12:14.940929 D | telemetry: set telemetry key: rook/node/count/with-ceph-daemons=0
2023-11-14 11:12:15.415596 D | telemetry: set telemetry key: rook/node/count/with-csi-rbd-plugin=1
2023-11-14 11:12:15.917123 D | telemetry: set telemetry key: rook/node/count/with-csi-cephfs-plugin=1
2023-11-14 11:12:16.405180 D | telemetry: set telemetry key: rook/node/count/with-csi-nfs-plugin=0

parth-gr · 2023-11-14T11:14:29Z

crash-collector doesn't get created always so should we add different way to count ceph nodes

travisn · 2023-11-14T19:28:07Z

pkg/operator/ceph/cluster/cluster.go

+
+	// Report the cephNodeCount
+	listoption := metav1.ListOptions{LabelSelector: fmt.Sprintf("%s=%s", k8sutil.AppAttr, nodedaemon.CrashCollectorAppName)}
+	cephNodeList, err := c.context.Clientset.CoreV1().Pods(operatorNamespace).List(c.ClusterInfo.Context, listoption)


We want the cluster namespace, not the operator namespace. The crash collectors will be in the same namespace as the cluster.

Updated, thanks

travisn · 2023-11-14T19:29:08Z

pkg/operator/ceph/cluster/cluster.go

+		if !kerrors.IsNotFound(err) {
+			logger.Warningf("failed to report the ceph node count. %v", err)
+		} else {
+			telemetry.ReportKeyValue(c.context, c.ClusterInfo, telemetry.CephNodeCount, "-1")


If there is an error, let's not report it. Then the last reported value will still be set, and no need to set to -1 for an intermittent failure.

I see we need to report for this case We can get this by counting the number of crash collector pods; however, if any users disable the crash collector, we will report -1 to represent "unknown"

travisn · 2023-11-15T17:30:30Z

pkg/operator/ceph/cluster/cluster.go

+	}
+
+	// Report the cephNodeCount
+	if !c.Spec.CrashCollector.Disable {


This looks backwards to me

Suggested change

if !c.Spec.CrashCollector.Disable {

if c.Spec.CrashCollector.Disable {

travisn · 2023-11-15T17:32:21Z

pkg/operator/ceph/cluster/cluster_test.go

+			telemetry.CephFSNodeCount:            "0",
+			telemetry.RBDNodeCount:               "0",
+			telemetry.NFSNodeCount:               "0",
+			telemetry.CephNodeCount:              "-1",


How about a unit test that checks where the nodes are not 0 or -1?

travisn · 2023-11-15T17:33:27Z

pkg/operator/ceph/csi/spec.go

-	csiRBDProvisioner    = "csi-rbdplugin-provisioner"
-	csiCephFSProvisioner = "csi-cephfsplugin-provisioner"
-	csiNFSProvisioner    = "csi-nfsplugin-provisioner"
+	CsiRBDProvisioner    = "csi-rbdplugin-provisioner"


I believe we don't need to change these provisioner variables now

Nope no change needed

Add this reporting with the cephcluster reconcile, Similar way we reported other telemetry's Closes: rook#12344 Signed-off-by: parth-gr <paarora@redhat.com>

core: report node metrics using ceph telemetry (backport #12850)

parth-gr requested review from travisn and Madhu-1 September 5, 2023 13:00

parth-gr force-pushed the node-telemetry branch from d1bff3a to e2afc90 Compare September 5, 2023 13:56

travisn requested changes Sep 5, 2023

View reviewed changes

parth-gr commented Sep 6, 2023

View reviewed changes

pkg/operator/ceph/csi/controller.go Outdated Show resolved Hide resolved

parth-gr force-pushed the node-telemetry branch from e2afc90 to 0dac5c3 Compare September 6, 2023 08:23

parth-gr requested a review from travisn September 6, 2023 08:23

parth-gr force-pushed the node-telemetry branch 5 times, most recently from 6249e96 to 835c7ce Compare September 6, 2023 12:14

parth-gr commented Sep 7, 2023

View reviewed changes

parth-gr force-pushed the node-telemetry branch 3 times, most recently from d2c474e to d6d24d3 Compare September 7, 2023 14:38

travisn reviewed Sep 11, 2023

View reviewed changes

github-actions bot added the stale Labeled by the stale bot label Oct 11, 2023

travisn removed the stale Labeled by the stale bot label Oct 11, 2023

BlaineEXE reviewed Nov 1, 2023

View reviewed changes

pkg/operator/ceph/csi/controller.go Outdated Show resolved Hide resolved

parth-gr force-pushed the node-telemetry branch 7 times, most recently from 74bbdfe to 8e53785 Compare November 8, 2023 16:20

parth-gr requested a review from travisn November 9, 2023 14:46

parth-gr requested a review from BlaineEXE November 9, 2023 14:46

travisn requested changes Nov 10, 2023

View reviewed changes

parth-gr force-pushed the node-telemetry branch from 8e53785 to f8f97c5 Compare November 13, 2023 12:32

parth-gr requested a review from travisn November 13, 2023 12:32

parth-gr force-pushed the node-telemetry branch 2 times, most recently from f9d7168 to 1415f35 Compare November 13, 2023 15:31

travisn requested changes Nov 13, 2023

View reviewed changes

parth-gr force-pushed the node-telemetry branch from 1415f35 to 765832b Compare November 14, 2023 08:47

parth-gr requested a review from travisn November 14, 2023 15:34

travisn requested changes Nov 14, 2023

View reviewed changes

parth-gr force-pushed the node-telemetry branch from 765832b to 3b36ae1 Compare November 15, 2023 13:45

parth-gr requested a review from travisn November 15, 2023 13:45

parth-gr force-pushed the node-telemetry branch from 3b36ae1 to 754f0f6 Compare November 15, 2023 14:09

travisn requested changes Nov 15, 2023

View reviewed changes

parth-gr force-pushed the node-telemetry branch from 754f0f6 to 87a2771 Compare November 16, 2023 13:31

parth-gr requested a review from travisn November 16, 2023 13:31

core: report node metrics using ceph telemetry

f007f2a

Add this reporting with the cephcluster reconcile, Similar way we reported other telemetry's Closes: rook#12344 Signed-off-by: parth-gr <paarora@redhat.com>

parth-gr force-pushed the node-telemetry branch from 87a2771 to f007f2a Compare November 16, 2023 13:43

travisn approved these changes Nov 16, 2023

View reviewed changes

travisn added the backport-release-1.12 label Nov 16, 2023

travisn merged commit 297e840 into rook:master Nov 16, 2023
51 checks passed

mergify bot mentioned this pull request Nov 16, 2023

core: report node metrics using ceph telemetry (backport #12850) #13224

Merged

mergify bot added a commit that referenced this pull request Nov 16, 2023

Merge pull request #13224 from rook/mergify/bp/release-1.12/pr-12850

a240a47

core: report node metrics using ceph telemetry (backport #12850)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core: report node metrics using ceph telemetry #12850

core: report node metrics using ceph telemetry #12850

parth-gr commented Sep 5, 2023 •

edited

travisn Sep 5, 2023

parth-gr Sep 6, 2023

travisn Sep 6, 2023

parth-gr Sep 7, 2023

parth-gr Sep 7, 2023

parth-gr commented Sep 11, 2023 •

edited

travisn Sep 11, 2023

github-actions bot commented Oct 11, 2023

travisn Nov 10, 2023

parth-gr Nov 13, 2023

travisn Nov 13, 2023

parth-gr Nov 14, 2023

parth-gr commented Nov 14, 2023

parth-gr commented Nov 14, 2023 •

edited

travisn Nov 14, 2023

parth-gr Nov 15, 2023

travisn Nov 14, 2023

parth-gr Nov 15, 2023

travisn Nov 15, 2023

parth-gr Nov 16, 2023

travisn Nov 15, 2023

parth-gr Nov 16, 2023

travisn Nov 15, 2023

parth-gr Nov 16, 2023

	listoption = metav1.ListOptions{LabelSelector: fmt.Sprintf("%s=%s", k8sutil.AppAttr, csi.CsiRBDProvisioner)}
	listoption = metav1.ListOptions{LabelSelector: fmt.Sprintf("%s=%s", k8sutil.AppAttr, csi.CsiRBDPlugin)}

	if !c.Spec.CrashCollector.Disable {
	if c.Spec.CrashCollector.Disable {

core: report node metrics using ceph telemetry #12850

core: report node metrics using ceph telemetry #12850

Conversation

parth-gr commented Sep 5, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parth-gr commented Sep 11, 2023 • edited

Choose a reason for hiding this comment

github-actions bot commented Oct 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parth-gr commented Nov 14, 2023

parth-gr commented Nov 14, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parth-gr commented Sep 5, 2023 •

edited

parth-gr commented Sep 11, 2023 •

edited

parth-gr commented Nov 14, 2023 •

edited