[release-3.6] Add an interface to query downgrade status #19439

fuweid · 2025-02-17T22:24:10Z

Bug report criteria

This bug report is not security related, security issues should be disclosed privately via etcd maintainers.
This is not a support request or question, support requests or questions should be raised in the etcd discussion forums.
You have read the etcd bug reporting guidelines.
Existing open issues along with etcd frequently asked questions have been checked and this is not a duplicate.

What happened?

We don't have an API to indicate when the downgrade process is complete. Cluster administrators must manually check the cluster, storage, and server versions from the member status. Once they confirm that all members have reached the target version, they can consider the downgrade process finished. If administrators do not manually cancel the downgrade and instead upgrade the members immediately after the downgrade, they may encounter an issue where the target version is never reached.

There is an example: Three-Members Cluster

T1: Enable the downgrade process.
T2: Downgrade all members from v3.6.0 to v3.5.0.
T3: Upgrade all members from v3.5.0 to v3.6.0.

By default, the leader cancels the downgrade process once all the members are ready in target version.

etcd/server/etcdserver/server.go

Lines 2394 to 2411 in eb7607b

    
           func (s *EtcdServer) monitorDowngrade() { 
        
           	monitor := serverversion.NewMonitor(s.Logger(), NewServerVersionAdapter(s)) 
        
           	t := s.Cfg.DowngradeCheckTime 
        
           	if t == 0 { 
        
           		return 
        
           	} 
        
           	for { 
        
           		select { 
        
           		case <-time.After(t): 
        
           		case <-s.stopping: 
        
           			return 
        
           		} 
        
           		if !s.isLeader() { 
        
           			continue 
        
           		} 
        
           		monitor.CancelDowngradeIfNeeded() 
        
           	}

If the leader doesn't cancel the downgrade in time before T3, after T3, all members will remain at cluster version v3.5.0, and the upgrade process will not complete until the administrator manually cancels the downgrade process.

This scenario is uncommon in real-world use cases (upgrading immediately after a downgrade), but we encountered this issue in a robustness test case (#19306)

Maybe we should consider removing that monitor and force the administrator to cancel downgrade process when it's finish.

ping @ahrtr @siyuanfoundation @serathius @wenjiaswe

What did you expect to happen?

After T3, all members can upgrade to v3.6.0.

How can we reproduce it (as minimally and precisely as possible)?

T0: Create three-members cluster
T1: Enable the downgrade process.
T2: Downgrade all members from v3.6.0 to v3.5.0.
T3: Upgrade all members from v3.5.0 to v3.6.0.

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version
# paste output here

$ etcdctl version
# paste output here

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

The text was updated successfully, but these errors were encountered:

ahrtr · 2025-02-18T09:47:00Z

Thanks @fuweid for raising this discussion.

I think it'd be better to add an interface for users to query the downgrade status anytime they want.
The etcdctl command is etcdctl downgrade status, and the output is something like below,

+------------------------+------------------+---------------+-----------------+-------------------+-------------------+
|        ENDPOINT        |        ID        |    VERSION    | STORAGE VERSION | Downgrade version | Downgrade Enabled | 
+------------------------+------------------+---------------+-----------------+-------------------+-------------------+
|  http://127.0.0.1:2379 | 8211f1d0f64f3269 | 3.5.18        |           3.5.0 |        3.5        |   true            |
| http://127.0.0.1:22379 | 91bc3c398fb3c146 | 3.6.0-alpha.0 |           3.5.0 |        3.5        |   true            |
| http://127.0.0.1:32379 | fd422379fda50e48 | 3.6.0-alpha.0 |           3.5.0 |        3.5        |   true            |
+------------------------+------------------+---------------+-----------------+-------------------+-------------------+

I was thinking to remove the AUTO Downgrade Cancellation. The original thought is that users explicitly start/enable the downgrade process, then they should explicitly stop/cancel downgrade as well. However, we should be good as long as we provide an interface for users to query the downgrade status as mentioned above. Also AUTO Downgrade Cancellation has a little benefit, as it automatically stops/cancel the downgrade for users after completion.

Please let me know your thought. thx

cc @fuweid @ivanvc @jmhbnz @serathius @siyuanfoundation

siyuanfoundation · 2025-02-18T18:40:04Z

I think it is better to keep the AUTO Downgrade Cancellation, because in real world use cases, most downgraded clusters would not be upgraded immediately. Adding extra manual step just increases ops overhead, and more prone to errors if that step is forgotten.
It is better to add a step to query the downgrade status before upgrade process and even if it is skipped it most likely would be fine.

fuweid · 2025-02-18T19:11:54Z

The etcdctl command is etcdctl downgrade status, and the output is something like below,

Sounds good to me. It's more useful for admin to confirm that downgrade process is already finished.

ahrtr · 2025-02-18T19:33:40Z

@fuweid Do you have bandwidth to add the downgrade query API? We need to backport it to release-3.6.

fuweid · 2025-02-18T19:35:31Z

@fuweid Do you have bandwidth to add the downgrade query API? We need to backport it to release-3.6.

Sure. Self-assigned

ahrtr · 2025-02-19T19:12:21Z

@ivanvc @jmhbnz Once this feature gets done, I think we should release v3.6.0-rc.1.

Hopefully this feature can be done this week or early next week (@fuweid I just added label "priority/important-soon" to this feature, please feel free to let me know if you need any help), and we can release v3.6.0-rc.1 later next week or early next next week.

fuweid · 2025-02-26T20:27:31Z

close by #19456

fuweid added area/documentation type/bug labels Feb 17, 2025

ahrtr added type/feature and removed type/bug labels Feb 18, 2025

fuweid self-assigned this Feb 18, 2025

henrybear327 mentioned this issue Feb 18, 2025

[test] Add e2e downgrade automatic cancellation test #19399

Open

ahrtr added the priority/important-soon label Feb 19, 2025

ahrtr changed the title ~~[v3.6.0-rc.0] Should we remove the monitor that automatically cancels the downgrade process once it's ready?~~ [release-3.6] Add an interface to query downgrade status Feb 19, 2025

fuweid closed this as completed Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[release-3.6] Add an interface to query downgrade status #19439

[release-3.6] Add an interface to query downgrade status #19439

fuweid commented Feb 17, 2025

paste your configuration here

ahrtr commented Feb 18, 2025 •

edited

Loading

siyuanfoundation commented Feb 18, 2025

fuweid commented Feb 18, 2025

ahrtr commented Feb 18, 2025

fuweid commented Feb 18, 2025

ahrtr commented Feb 19, 2025

fuweid commented Feb 26, 2025

[release-3.6] Add an interface to query downgrade status #19439

[release-3.6] Add an interface to query downgrade status #19439

Comments

fuweid commented Feb 17, 2025

Bug report criteria

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

ahrtr commented Feb 18, 2025 • edited Loading

siyuanfoundation commented Feb 18, 2025

fuweid commented Feb 18, 2025

ahrtr commented Feb 18, 2025

fuweid commented Feb 18, 2025

ahrtr commented Feb 19, 2025

fuweid commented Feb 26, 2025

ahrtr commented Feb 18, 2025 •

edited

Loading