Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[config reload][202012] race condition introduced by cli.run_command #2696

Open
zjswhhh opened this issue Feb 28, 2023 · 2 comments
Open

[config reload][202012] race condition introduced by cli.run_command #2696

zjswhhh opened this issue Feb 28, 2023 · 2 comments
Assignees

Comments

@zjswhhh
Copy link
Contributor

zjswhhh commented Feb 28, 2023

Description

Steps to reproduce the issue

  1. make sure your version is <= 20201231.81 (in higher version we don't populated last switchover time to DB for config reload)
  2. toggle mux mode to standby
  3. config reload
  4. show mux metrics

Describe the results you received

# show mux metrics Ethernet124
PORT         EVENT                          TIME
-----------  -----------------------------  ---------------------------
Ethernet124  linkmgrd_switch_standby_start  2023-Feb-27 21:58:54.814214
Ethernet124  orch_switch_standby_start      2023-Feb-27 21:58:54.823527
Ethernet124  orch_switch_standby_end        2023-Feb-27 21:58:54.841689
Ethernet124  linkmgrd_switch_standby_end    2023-Feb-27 21:58:54.894002
Ethernet124  xcvrd_switch_standby_start     2023-Feb-27 22:02:29.060430
Ethernet124  xcvrd_switch_standby_end       2023-Feb-27 22:02:29.066666 

linkmgrd_switch_standby_start is stale because config reload uses write_standby to trigger switchovers, which bypasses linkmgrd.

orch_switch_standby_start and orch_switch_standby_end is stale because orchagent won't update metrics if standby -> standby.

linkmgrd_switch_standby_end is stale expectedly in version <= .81

Describe the results you expected

linkmgrd_switch_standby_end should be updated.

Additional information you deem important (e.g. issue happens only occasionally)

Output of show version

(paste your output here)
@zjswhhh zjswhhh self-assigned this Feb 28, 2023
@zjswhhh
Copy link
Contributor Author

zjswhhh commented Feb 28, 2023

The cause is config reload calls db_migrator with return_cmd=False.

Before STATE_DB is "migrated", services are started.

sonic-utilities/config/main.py

Lines 1666 to 1683 in 6f84aae

# Migrate DB contents to latest version
db_migrator='/usr/local/bin/db_migrator.py'
if os.path.isfile(db_migrator) and os.access(db_migrator, os.X_OK):
if namespace is None:
command = "{} -o migrate".format(db_migrator)
else:
command = "{} -o migrate -n {}".format(db_migrator, namespace)
clicommon.run_command(command, display_cmd=True)
# Re-generate the environment variable in case config_db.json was edited
update_sonic_environment()
# We first run "systemctl reset-failed" to remove the "failed"
# status from all services before we attempt to restart them
if not no_service_restart:
_reset_failed_services()
log.log_notice("'reload' restarting services...")
_restart_services()

@zjswhhh
Copy link
Contributor Author

zjswhhh commented Feb 28, 2023

Discussed with Vaibhav offline, seems like db_migrator is not the cause as it only explicitly retore certain list of tables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant