Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RSDK-8187] config cache changes #262

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

allisonschiang
Copy link
Contributor

Added storage into the config_monitor in order to write newly detected configs to NVS before rebooting, preventing a reboot cycle where the config is out of date.

However, this introduced a new issue where some invalid configs would cause a reboot loop, as it would crash the robot before entry.rs could detect a robot error and reset_robot_configuration().

Some fixes could be:

  • Keep this approach and try to catch configs that would cause a reboot loop (don't know if config is valid until bot crashes)
  • Keep this approach and individually return errors to catch crashing invalid configs as we discover them
  • Add another check that pulls config from app on boot-up and replaces the cache with any new one so if the user changes to a valid config it will overwrite the invalid config in cache

@allisonschiang allisonschiang requested a review from a team as a code owner July 15, 2024 14:51
ServerError: From<<S as RobotConfigurationStorage>::Error>,
<S as WifiCredentialStorage>::Error: Sync + Send + 'static,
{
pub fn new(restart_hook: impl FnOnce() + 'a, curr_config: ConfigResponse, storage: S) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make the lambda come last.

false => {
if let Err(e) = self
.storage
.store_robot_configuration(new_config.config.unwrap())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm suspicious of unwrap. What should happen here if new_config.config is None?

Copy link
Contributor Author

@allisonschiang allisonschiang Jul 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure, what happens in entry.rs when get_config is called and new_config is None or returns an AppClientError?

let (app_config, cfg_response, cfg_received_datetime) =
app_client.get_config(Some(network.get_ip())).await.unwrap();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing it panics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I see, I already new_config is already unwrapped

@allisonschiang allisonschiang self-assigned this Jul 15, 2024
Comment on lines +15 to +18
S: RobotConfigurationStorage + WifiCredentialStorage + Clone + 'static,
<S as RobotConfigurationStorage>::Error: Debug,
ServerError: From<<S as RobotConfigurationStorage>::Error>,
<S as WifiCredentialStorage>::Error: Sync + Send + 'static,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work unless feature = "provisioning" is enabled, but see #267 which offers a way out. Perhaps this change should wait on that one.

micro-rdk/src/common/config_monitor.rs Show resolved Hide resolved
@acmorrow
Copy link
Member

Added storage into the config_monitor in order to write newly detected configs to NVS before rebooting, preventing a reboot cycle where the config is out of date.

However, this introduced a new issue where some invalid configs would cause a reboot loop, as it would crash the robot before entry.rs could detect a robot error and reset_robot_configuration().

Some fixes could be:

  • Keep this approach and try to catch configs that would cause a reboot loop (don't know if config is valid until bot crashes)
  • Keep this approach and individually return errors to catch crashing invalid configs as we discover them
  • Add another check that pulls config from app on boot-up and replaces the cache with any new one so if the user changes to a valid config it will overwrite the invalid config in cache

What about the following idea: when building from a cached config, we first fetch the config, and then we immediately erase it. Then we build the robot. Only if the robot builds OK do we write it back to the cache again. I don't love it because of the extra flash wear it implies, but that's already a problem we have.

@acmorrow
Copy link
Member

Added storage into the config_monitor in order to write newly detected configs to NVS before rebooting, preventing a reboot cycle where the config is out of date.
However, this introduced a new issue where some invalid configs would cause a reboot loop, as it would crash the robot before entry.rs could detect a robot error and reset_robot_configuration().
Some fixes could be:

  • Keep this approach and try to catch configs that would cause a reboot loop (don't know if config is valid until bot crashes)
  • Keep this approach and individually return errors to catch crashing invalid configs as we discover them
  • Add another check that pulls config from app on boot-up and replaces the cache with any new one so if the user changes to a valid config it will overwrite the invalid config in cache

What about the following idea: when building from a cached config, we first fetch the config, and then we immediately erase it. Then we build the robot. Only if the robot builds OK do we write it back to the cache again. I don't love it because of the extra flash wear it implies, but that's already a problem we have.

I discussed this idea with @npmenard and he feels like the flash wear is too much of a problem. Please move forward with committing this (once you have addressed the failure to build without provisioning), and file a new bug ticket about the situation that arises when a panic inducing config is cached. Then we can tackle that in a new PR.

Copy link
Member

@mattjperez mattjperez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good for me (nice work on the generics in signatures), but definitely

  1. wait for new [RSDK-8219] allow micro-RDK to build w/out provisioning feature #267 to merge and retest compilation with and without provisioning
  2. make and reference a follow up ticket in TODO comment

Comment on lines +72 to +77
if let Some(config) = new_config.config {
if let Err(e) = self.storage.store_robot_configuration(config) {
log::warn!("Failed to store new robot configuration from app: {}", e);
}
}
self.restart();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add the crash loop bug as a TODO comment here once it's made.

@@ -43,7 +69,12 @@ impl<'a> PeriodicAppClientTask for ConfigMonitor<'a> {
app_client.clone().get_config(None).await
{
if self.curr_config != *new_config {
self.restart()
if let Some(config) = new_config.config {
if let Err(e) = self.storage.store_robot_configuration(config) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we consider erasing the config then restarting?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could. That would leave open the possibility that if you can't connect when you come back online, you won't be able to build a robot until you talk to app again. That might be a risk worth taking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could work, its either a empty config or an invalid config if you can't come back online which both wouldn't work. Should I implement this solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants