Skip to content

Implement a comprehensive diagnosis tool #20217

Open
@ahrtr

Description

@ahrtr

What would you like to be added?

Currently we have several scattered etcd diagnosis tools,

I think it would be better to deliver a comprehensive tool that integrates all of these capabilities. It should support both online and offline diagnosis:

  • online diagnosis connects to a running etcd cluster and gathers diagnostic data.
  • offline diagnosis, it analyzes etcd data directly when the etcd instance isn’t running.

A couple of use cases (stories)

  • When an user raises an issue, we can request them to use the diagnosis tool to generate a report to gather all required info something like etcd_diagnosis_report.json to avoid long back and forth communication.
  • When the etcd cluster is completely down, and users need to recover the cluster from one of the members using flag --force-new-cluster, they need to figure out the best member to restore the cluster from. In such case, they will need to use the diagnosis tool to figure out which member has the latest data.
  • When the cluster runs out db space quota, and it's already down. Users will need to figure out which resources consume most of the space. They can use the diagnosis tool.
  • For advanced users, they may want to make some offline analysis and gain deeper insights, they can use the diagnosis tool

We can add the comprehensive tool under https://github.com/etcd-io/etcd/tree/main/tools. Once it's done, we can deprecate the existing etcd-dump-logs and etcd-dump-db

cc @fuweid @ivanvc @jmhbnz @serathius

Why is this needed?

To improve users diagnosis experience.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions