Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release-3.6] Add an interface to query downgrade status #19439

Closed
4 tasks done
fuweid opened this issue Feb 17, 2025 · 7 comments
Closed
4 tasks done

[release-3.6] Add an interface to query downgrade status #19439

fuweid opened this issue Feb 17, 2025 · 7 comments
Assignees
Labels
area/documentation priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. type/feature

Comments

@fuweid
Copy link
Member

fuweid commented Feb 17, 2025

Bug report criteria

What happened?

We don't have an API to indicate when the downgrade process is complete. Cluster administrators must manually check the cluster, storage, and server versions from the member status. Once they confirm that all members have reached the target version, they can consider the downgrade process finished. If administrators do not manually cancel the downgrade and instead upgrade the members immediately after the downgrade, they may encounter an issue where the target version is never reached.

There is an example: Three-Members Cluster

  • T1: Enable the downgrade process.
  • T2: Downgrade all members from v3.6.0 to v3.5.0.
  • T3: Upgrade all members from v3.5.0 to v3.6.0.

By default, the leader cancels the downgrade process once all the members are ready in target version.

func (s *EtcdServer) monitorDowngrade() {
monitor := serverversion.NewMonitor(s.Logger(), NewServerVersionAdapter(s))
t := s.Cfg.DowngradeCheckTime
if t == 0 {
return
}
for {
select {
case <-time.After(t):
case <-s.stopping:
return
}
if !s.isLeader() {
continue
}
monitor.CancelDowngradeIfNeeded()
}

If the leader doesn't cancel the downgrade in time before T3, after T3, all members will remain at cluster version v3.5.0, and the upgrade process will not complete until the administrator manually cancels the downgrade process.

This scenario is uncommon in real-world use cases (upgrading immediately after a downgrade), but we encountered this issue in a robustness test case (#19306)

Maybe we should consider removing that monitor and force the administrator to cancel downgrade process when it's finish.

ping @ahrtr @siyuanfoundation @serathius @wenjiaswe

What did you expect to happen?

After T3, all members can upgrade to v3.6.0.

How can we reproduce it (as minimally and precisely as possible)?

  • T0: Create three-members cluster
  • T1: Enable the downgrade process.
  • T2: Downgrade all members from v3.6.0 to v3.5.0.
  • T3: Upgrade all members from v3.5.0 to v3.6.0.

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version
# paste output here

$ etcdctl version
# paste output here

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

@ahrtr
Copy link
Member

ahrtr commented Feb 18, 2025

Thanks @fuweid for raising this discussion.

I think it'd be better to add an interface for users to query the downgrade status anytime they want.
The etcdctl command is etcdctl downgrade status, and the output is something like below,

+------------------------+------------------+---------------+-----------------+-------------------+-------------------+
|        ENDPOINT        |        ID        |    VERSION    | STORAGE VERSION | Downgrade version | Downgrade Enabled | 
+------------------------+------------------+---------------+-----------------+-------------------+-------------------+
|  http://127.0.0.1:2379 | 8211f1d0f64f3269 | 3.5.18        |           3.5.0 |        3.5        |   true            |
| http://127.0.0.1:22379 | 91bc3c398fb3c146 | 3.6.0-alpha.0 |           3.5.0 |        3.5        |   true            |
| http://127.0.0.1:32379 | fd422379fda50e48 | 3.6.0-alpha.0 |           3.5.0 |        3.5        |   true            |
+------------------------+------------------+---------------+-----------------+-------------------+-------------------+

I was thinking to remove the AUTO Downgrade Cancellation. The original thought is that users explicitly start/enable the downgrade process, then they should explicitly stop/cancel downgrade as well. However, we should be good as long as we provide an interface for users to query the downgrade status as mentioned above. Also AUTO Downgrade Cancellation has a little benefit, as it automatically stops/cancel the downgrade for users after completion.

Please let me know your thought. thx

cc @fuweid @ivanvc @jmhbnz @serathius @siyuanfoundation

@siyuanfoundation
Copy link
Contributor

I think it is better to keep the AUTO Downgrade Cancellation, because in real world use cases, most downgraded clusters would not be upgraded immediately. Adding extra manual step just increases ops overhead, and more prone to errors if that step is forgotten.
It is better to add a step to query the downgrade status before upgrade process and even if it is skipped it most likely would be fine.

@fuweid
Copy link
Member Author

fuweid commented Feb 18, 2025

The etcdctl command is etcdctl downgrade status, and the output is something like below,

Sounds good to me. It's more useful for admin to confirm that downgrade process is already finished.

@ahrtr
Copy link
Member

ahrtr commented Feb 18, 2025

@fuweid Do you have bandwidth to add the downgrade query API? We need to backport it to release-3.6.

@fuweid fuweid self-assigned this Feb 18, 2025
@fuweid
Copy link
Member Author

fuweid commented Feb 18, 2025

@fuweid Do you have bandwidth to add the downgrade query API? We need to backport it to release-3.6.

Sure. Self-assigned

@ahrtr ahrtr added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Feb 19, 2025
@ahrtr ahrtr changed the title [v3.6.0-rc.0] Should we remove the monitor that automatically cancels the downgrade process once it's ready? [release-3.6] Add an interface to query downgrade status Feb 19, 2025
@ahrtr
Copy link
Member

ahrtr commented Feb 19, 2025

@ivanvc @jmhbnz Once this feature gets done, I think we should release v3.6.0-rc.1.

Hopefully this feature can be done this week or early next week (@fuweid I just added label "priority/important-soon" to this feature, please feel free to let me know if you need any help), and we can release v3.6.0-rc.1 later next week or early next next week.

@fuweid
Copy link
Member Author

fuweid commented Feb 26, 2025

close by #19456

@fuweid fuweid closed this as completed Feb 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/documentation priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. type/feature
Development

No branches or pull requests

3 participants