-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core: report node metrics using ceph telemetry #12850
Conversation
d1bff3a
to
e2afc90
Compare
pkg/operator/ceph/csi/controller.go
Outdated
// Report the cephNodeCount | ||
// abc := nodedaemon.CrashCollectorAppName | ||
listoption := metav1.ListOptions{LabelSelector: fmt.Sprintf("%q=%q", k8sutil.AppAttr, "rook-ceph-crashcollector")} | ||
cephNodeList, err := context.Clientset.CoreV1().Nodes().List(clusterInfo.Context, listoption) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of a separate call to list the nodes multiple times, just use the same nodeList
that was retrieved above on line 287. You can iterate over it to evaluate the labels locally, so no separate query is necessary. One query of nodes is much more efficient than five.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tested this? Actually I don't think this will work. The nodes don't have labels on them for the daemons, so we likely need a separate query for each daemon to count how many pods running of each type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we need to revert it, doesn't work,
There is one more problem how to disable for first reconcile of cluster creation as
2023-09-07 14:22:12.502209 W | telemetry: failed to set telemetry key "rook/node/count/kubernetes-total". failed to set "rook/node/count/kubernetes-total" in the mon config-key store. output: Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',): exit status 1
Logs this
e2afc90
to
0dac5c3
Compare
6249e96
to
835c7ce
Compare
@@ -151,7 +154,7 @@ func TestCephCSIController(t *testing.T) { | |||
|
|||
res, err := r.Reconcile(ctx, req) | |||
assert.NoError(t, err) | |||
assert.False(t, res.Requeue) | |||
assert.True(t, res.Requeue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Madhu-1 do you have any idea about this,
It is sending a reque signal. ANd changing to true fixes it
d2c474e
to
d6d24d3
Compare
@travisn would check a cephcluster status or observed generation for getting node telemetry would be a right way? |
pkg/operator/ceph/csi/controller.go
Outdated
@@ -263,3 +272,62 @@ func (r *ReconcileCSI) reconcile(request reconcile.Request) (reconcile.Result, e | |||
|
|||
return reconcileResult, nil | |||
} | |||
|
|||
func reportNodeTelemetry(context *clusterd.Context, clusterInfo *cephclient.ClusterInfo) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of creating this method during the csi reconcile, we can just implement it inside the existing reportTelemetry method. Then we know the ceph cluster connection is available. I don't see a need to keep it with csi, especially since not all these node metrics are specific to csi.
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions. |
74bbdfe
to
8e53785
Compare
pkg/operator/ceph/cluster/cluster.go
Outdated
// Report the cephNodeCount | ||
// abc := nodedaemon.CrashCollectorAppName | ||
listoption := metav1.ListOptions{LabelSelector: fmt.Sprintf("%s=%s", k8sutil.AppAttr, "rook-ceph-crashcollector")} | ||
cephNodeList, err := c.context.Clientset.CoreV1().Nodes().List(c.ClusterInfo.Context, listoption) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about this comment? I thought we couldn't query the nodes like this to get the daemon count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed
8e53785
to
f8f97c5
Compare
f9d7168
to
1415f35
Compare
pkg/operator/ceph/cluster/cluster.go
Outdated
} | ||
|
||
// Report the csi rbd node count | ||
listoption = metav1.ListOptions{LabelSelector: fmt.Sprintf("%s=%s", k8sutil.AppAttr, csi.CsiRBDProvisioner)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need the rbd plugin, not the provisioner. There are only two provisioner pods, but the volume plugin would be with the daemonset on (most) nodes.
listoption = metav1.ListOptions{LabelSelector: fmt.Sprintf("%s=%s", k8sutil.AppAttr, csi.CsiRBDProvisioner)} | |
listoption = metav1.ListOptions{LabelSelector: fmt.Sprintf("%s=%s", k8sutil.AppAttr, csi.CsiRBDPlugin)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
1415f35
to
765832b
Compare
Testing
|
crash-collector doesn't get created always so should we add different way to count ceph nodes |
pkg/operator/ceph/cluster/cluster.go
Outdated
|
||
// Report the cephNodeCount | ||
listoption := metav1.ListOptions{LabelSelector: fmt.Sprintf("%s=%s", k8sutil.AppAttr, nodedaemon.CrashCollectorAppName)} | ||
cephNodeList, err := c.context.Clientset.CoreV1().Pods(operatorNamespace).List(c.ClusterInfo.Context, listoption) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want the cluster namespace, not the operator namespace. The crash collectors will be in the same namespace as the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks
pkg/operator/ceph/cluster/cluster.go
Outdated
if !kerrors.IsNotFound(err) { | ||
logger.Warningf("failed to report the ceph node count. %v", err) | ||
} else { | ||
telemetry.ReportKeyValue(c.context, c.ClusterInfo, telemetry.CephNodeCount, "-1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is an error, let's not report it. Then the last reported value will still be set, and no need to set to -1 for an intermittent failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see we need to report for this case We can get this by counting the number of crash collector pods; however, if any users disable the crash collector, we will report -1 to represent "unknown"
765832b
to
3b36ae1
Compare
3b36ae1
to
754f0f6
Compare
pkg/operator/ceph/cluster/cluster.go
Outdated
} | ||
|
||
// Report the cephNodeCount | ||
if !c.Spec.CrashCollector.Disable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks backwards to me
if !c.Spec.CrashCollector.Disable { | |
if c.Spec.CrashCollector.Disable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
telemetry.CephFSNodeCount: "0", | ||
telemetry.RBDNodeCount: "0", | ||
telemetry.NFSNodeCount: "0", | ||
telemetry.CephNodeCount: "-1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about a unit test that checks where the nodes are not 0 or -1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
pkg/operator/ceph/csi/spec.go
Outdated
csiRBDProvisioner = "csi-rbdplugin-provisioner" | ||
csiCephFSProvisioner = "csi-cephfsplugin-provisioner" | ||
csiNFSProvisioner = "csi-nfsplugin-provisioner" | ||
CsiRBDProvisioner = "csi-rbdplugin-provisioner" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we don't need to change these provisioner variables now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope no change needed
754f0f6
to
87a2771
Compare
Add this reporting with the cephcluster reconcile, Similar way we reported other telemetry's Closes: rook#12344 Signed-off-by: parth-gr <paarora@redhat.com>
87a2771
to
f007f2a
Compare
core: report node metrics using ceph telemetry (backport #12850)
Add this reporting with the cephcluster reconcile,
Similar way we reported other telemetry's
Closes: #12344
Description of your changes:
Which issue is resolved by this Pull Request:
Resolves #
Checklist:
skip-ci
on the PR.