Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fastboot] fastboot enhancement: Use warm-boot infrastructure for fast-boot #2286

Merged
merged 5 commits into from
Sep 20, 2022

Conversation

arfeigin
Copy link
Contributor

@arfeigin arfeigin commented Aug 1, 2022

This PR should be merged together with the sonic-sairedis PR (sonic-net/sonic-sairedis#1100) and sonic-buildimage PR (sonic-net/sonic-buildimage#11594).

What I did

Improve fast-reboot flow.

How I did it

Using warm-reboot infrastructure.
Clear all routes except of default routes for faster reconciliation time.

How to verify it

Community fast-reboot test, manual testing.

Previous command output (if the output of a command-line utility has changed)

New command output (if the output of a command-line utility has changed)

@arfeigin arfeigin changed the title Use warm-boot infrastructure for fast-boot [fastboot] fastboot enhancement: Use warm-boot infrastructure for fast-boot Aug 1, 2022
@vaibhavhd
Copy link
Contributor

Why is this PR in draft mode? Is this not complete?

# Clear all routes except of default routes for faster reconciliation time.
sonic-db-cli APPL_DB eval "
for _, k in ipairs(redis.call('keys', '*')) do
if string.match(k, 'ROUTE_TABLE:') and not string.match(k, 'ROUTE_TABLE:0.0.0.0/0') and not string.match(k, 'ROUTE_TABLE:::/0') then \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this improve reconciliation time? Is that due to less routes to reconcile?

Why do we need this for fast-reboot, if this is not done for warm-reboot?

In fast-reboot case, does this mean we will have dataplane impact much sooner than the time when ASIC reset happens?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the same as the previous implementation of "Fast-Reboot", we save only the default routes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something, can you please point me to the present logic for save only the default routes in fast-reboot case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saving of default routes was done by the fast-reboot-dump script other than that it was saving also ARP and FDB tables. It was being saved to /host/fast-reboot/default_routes.json

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new suggested approach was tested side-by-side with the previous approach of dumping the arp/fdb/default_routes to json files and reloading them. This single change improves dataplane downtime by more than a second by itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted this change, will use the current implementation of using fast-reboot-dump and filter_fdb_entries scripts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I inspected the last change made - reintroducing the use of fast-reboot-dump and filder_fdb_entries and the implications it had on fast-reboot performance and I wanted to rest you assured that it is not needed.

The change of having fast-reboot over warm-reboot infrastructure introduces the call to backup_database which wasn't used previously for fast-reboot. This function backups FDB_TABLE and APP_DB is being dumped as well to dump.rdb so also media settings are being stored and alongside with ARPs. This is being restored in the change made in the sonic-buildimage PR which should be merged together. In this case the issue https://github.com/sonic-net/sonic-buildimage/issues/9165 won't reproduce.

The main change that the relevant data will be restored in a similar way to as it is being restored in warm-reboot scenario and not using fdb.json; arp.json; default_routes.json; media_config.json files.

If we will call backup_database function without deleting all routes except default routes will result in storing and then restoring all the routes after reset. This will affect dataplane downtime as the amount of routes increases and this is not a desired behavior.

The specific implementation wasn't explicitly described in the HLD but the use of warm-reboot infrastructure i.e. use backup_database and dump.rdb is the cause for this change as the current logic of storing arp/fdb/default_routes and media settings is not changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't read in detail. but the explanation sound reasonable to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tjchadaga can you please check this PR to ensure sonic-net/sonic-buildimage#9165 will not be seen again after this new design change gets merged?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vaibhavhd - based on the design of backup and restore of app db using dump.rdb, i think media settings should be restored on time to avoid sonic-net/sonic-buildimage#9165

scripts/fast-reboot Show resolved Hide resolved
scripts/fast-reboot Outdated Show resolved Hide resolved
scripts/fast-reboot Show resolved Hide resolved
@arfeigin
Copy link
Contributor Author

/easycla

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Aug 11, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: arfeigin / name: Aryeh Feigin (5271e5a)

scripts/fast-reboot Show resolved Hide resolved
sonic-db-cli STATE_DB SET "FAST_REBOOT|system" "1" "EX" "180" &>/dev/null
config warm_restart enable system
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For people who does not know the background, it is really confuse here, I would suggest to add comment here to explain why fastboot requires "enable warmboot".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, it confused me.

scripts/fast-reboot Show resolved Hide resolved
stepanblyschak
stepanblyschak previously approved these changes Sep 8, 2022
@arfeigin
Copy link
Contributor Author

arfeigin commented Sep 8, 2022

/azpw run Azure.sonic-utilities

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-utilities

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@qiluo-msft
Copy link
Contributor

@vaibhavhd Could you help review?

@@ -659,6 +631,15 @@ if [[ "$REBOOT_TYPE" = "fast-reboot" ]]; then
unload_kernel
exit "${EXIT_COUNTERPOLL_DELAY_FAILURE}"
fi

# Clear all routes except of default routes for faster reconciliation time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a delete call (cannot recover from here), shouldn't this step be executed after set +e ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! modified.

will update also in #2365 (PR for 202205)

vaibhavhd pushed a commit that referenced this pull request Sep 13, 2022
This PR is similar to #2286

This PR should be merged together with the sonic-sairedis PR (sonic-net/sonic-sairedis#1121) and sonic-buildimage PR (sonic-net/sonic-buildimage#12026).

Improve fast-reboot flow by using warm-reboot infrastructure.
Clear all routes except of default routes for faster reconciliation time.

Verified by community fast-reboot test, manual testing.
@qiluo-msft
Copy link
Contributor

Feel free to merge this PR. bandit checker is optional.

@vaibhavhd vaibhavhd merged commit 7099fff into sonic-net:master Sep 20, 2022
dprital added a commit to dprital/sonic-buildimage that referenced this pull request Sep 21, 2022
Update sonic-utilities submodule pointer to include the following:
* 0a7557b [minigraph] add option to specify golden path in load_minigraph ([sonic-net#2350](sonic-net/sonic-utilities#2350))
* 322aefc [GCU]Remove GCU unique lane check for duplicate lanes platforms ([sonic-net#2343](sonic-net/sonic-utilities#2343))
* 7099fff [fastboot] fastboot enhancement: Use warm-boot infrastructure for fast-boot ([sonic-net#2286](sonic-net/sonic-utilities#2286))
* 09026ed [warm-reboot] fix warm-reboot when /tmp/cache is missing ([sonic-net#2367](sonic-net/sonic-utilities#2367))
* a3c404c Fix typo in platform_sfputil_helper.is_rj45_port ([sonic-net#2374](sonic-net/sonic-utilities#2374))
* 637d834 Vnet_route_check Vxlan tunnel route update. ([sonic-net#2281](sonic-net/sonic-utilities#2281))
* 29a3e51 Added support for tunnel route status in show vnet routes all. ([sonic-net#2341](sonic-net/sonic-utilities#2341))
* 1ac584b Use 'default' VRF when VRF name is not provided ([sonic-net#2368](sonic-net/sonic-utilities#2368))
* 4d377a6 [subinterface]Added additional checks in portchannel and subinterface commands ([sonic-net#2345](sonic-net/sonic-utilities#2345))
* bbcdf2e disk_check: Publish event  for RO state ([sonic-net#2320](sonic-net/sonic-utilities#2320))
* 3fd537b Support the bandit check by GitHub Action ([sonic-net#2358](sonic-net/sonic-utilities#2358))
* 491d3d3 [generate dump]Added error message when saisdkdump fails ([sonic-net#2356](sonic-net/sonic-utilities#2356))
* 6830e01 [counterpoll]Fixing counterpoll show for tunnel and acl stats ([sonic-net#2355](sonic-net/sonic-utilities#2355))
* 3be2ad7 [fast-reboot]Avoid stopping masked services during fast-reboot ([sonic-net#2335](sonic-net/sonic-utilities#2335))
* 0e1b0cf [GCU] Fix missing backend in dry run ([sonic-net#2347](sonic-net/sonic-utilities#2347))
* 676c31b Add verification for override ([sonic-net#2305](sonic-net/sonic-utilities#2305))
* 48997c2 Add Password Hardening CLI support ([sonic-net#2338](sonic-net/sonic-utilities#2338))
* 414e239 update unit tests for swap ([#locato](https://github.com/sonic-net/sonic-utilities/pull/locato))
* a91a492 consider swap checking memory in ([#stalle](https://github.com/sonic-net/sonic-utilities/pull/stalle))
* f0ce586 [route_check]: Ignore standalone tunnel routes ([sonic-net#2325](sonic-net/sonic-utilities#2325))

Signed-off-by: dprital <drorp@nvidia.com>
vaibhavhd pushed a commit to sonic-net/sonic-buildimage that referenced this pull request Sep 26, 2022
This PR should be merged together with the sonic-utilities PR (sonic-net/sonic-utilities#2286) and sonic-sairedis PR (sonic-net/sonic-sairedis#1100).

Use redis contents from dump file in fast-reboot.

Improve fast-reboot flow by utilizing the warm-reboot infrastructure.
This followes https://github.com/sonic-net/SONiC/blob/master/doc/fast-reboot/Fast-reboot_Flow_Improvements_HLD.md
EdenGri pushed a commit to EdenGri/sonic-utilities that referenced this pull request Oct 12, 2022
…t-boot (sonic-net#2286)

This PR should be merged together with the sonic-sairedis PR (sonic-net/sonic-sairedis#1100) and sonic-buildimage PR (sonic-net/sonic-buildimage#11594).

This is done to improve fast-reboot flow by:
Using warm-reboot infrastructure.
Clear all routes except of default routes for faster reconciliation time.
qiluo-msft pushed a commit to sonic-net/sonic-swss-common that referenced this pull request Oct 18, 2022
…691)

**Depends on:**

- sonic-net/sonic-sairedis#1100
- sonic-net/sonic-utilities#2286
- sonic-net/sonic-buildimage#11594

**Why I did this?**

Daemons which are not related to warm/fast restart might affect the performance of warm/fast restart. A hardcoded start up delay is the current solution to avoid this.

This PR implements a function to wait warm/fast restart done. This function provided a efficiency and graceful way for daemons to wait warm/fast restart done.

**How I did it?**

Implement a utility function RestartWaiter::waitRestartDone. This function waits warm restart done flag in STATE DB and return true if the flag is set by warm restart finalizer. This function is also exposed as python extension so that python daemons can utilize it.

This PR depends on new fastboot design: https://github.com/sonic-net/SONiC/blob/master/doc/fast-reboot/Fast-reboot_Flow_Improvements_HLD.md
mdanish-kh pushed a commit to hamnarauf/sonic-utilities that referenced this pull request Oct 22, 2022
…t-boot (sonic-net#2286)

This PR should be merged together with the sonic-sairedis PR (sonic-net/sonic-sairedis#1100) and sonic-buildimage PR (sonic-net/sonic-buildimage#11594).

This is done to improve fast-reboot flow by:
Using warm-reboot infrastructure.
Clear all routes except of default routes for faster reconciliation time.
preetham-singh pushed a commit to preetham-singh/sonic-utilities that referenced this pull request Nov 21, 2022
…t-boot (sonic-net#2286)

This PR should be merged together with the sonic-sairedis PR (sonic-net/sonic-sairedis#1100) and sonic-buildimage PR (sonic-net/sonic-buildimage#11594).

This is done to improve fast-reboot flow by:
Using warm-reboot infrastructure.
Clear all routes except of default routes for faster reconciliation time.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants