Robot Settings Error Handling #184

trickeydan · 2022-11-19T14:28:28Z

At the moment if a robot-settings.toml file is invalid, we silently regenerate it without any notice to the user.

This PR adds a robot-settings-error.txt that is generated when an error occurs. This file contains a message to tell the user that their settings has been regenerated, a detailed error for why the settings were invalid and in most cases a copy of the old settings file (as long as the file is valid unicode).

An error file is not generated if the file is missing, as this is the expected behaviour for a new USB drive.

This PR also adds better validation of the WiFi PSK.

Example error file:

There was an error loading your robot-settings.toml
Your robot-settings.toml has been overwritten.

robot-settings.toml did not match schema: 1 validation error for ParsingModel[RobotSettings]
__root__ -> team_tla
  Team name did not match format: ABC, ABC1 etc. (type=value_error)

Invalid settings file:

team_tla = "MYROBOT"
usercode_entrypoint = "robot.py"
wifi_psk = "64ea-33fc-c10b"
wifi_region = "GB"
wifi_enabled = true

Also fixes #175

Fixes #175

PeterJCLaw · 2022-11-19T18:10:46Z

How important is is that the system carry on in the face of errors? Would rejecting the file (or invalid settings) and then stopping be a valid option?
For example: is there an error LED somewhere we can use to indicate failure?

The current approach here does improve things, as there's now some diagnostic information, however it still feels relatively easy to miss that an error file has been created. Additionally, it looks like the error file will persist even when the settings read was successful which could complicate things on subsequent runs.

It may be worth thinking through what sorts of scenarios can cause errors here. For example: could it be that someone has customised their WiFi config, had it working, then tweaked some other aspect of the config and now finds they can't connect to the WiFi due to the settings having all been regenerated? That sort of thing feels like it could end up in blind alleys around "why has my WiFi stopped working (but the robot still ran)" which miss the fact that the WiFi is actually working just with different credentials.

Having the old config output as part of the error file feels like a good aspect here, however I'm not really convinced that the "regenerate and replace" approach is a good one to start with.
I would encourage only modifying what's on the USB stick if the file (or perhaps individual settings) are completely missing and preferring not to modify something the user has provided (even if what they've provided is garbage). Following from that I'd encourage gracefully degrading or loudly complaining about the errors (or both) rather than quietly making things work implicitly.
While perhaps less convenient I would expect doing so to lead to much more "obvious" behaviours both for competitors and implementers.

(Am I missing something which makes that sort of approach unfeasible?)

trickeydan · 2022-11-19T20:21:53Z

Thanks for the good questions, much appreciated :)

How important is is that the system carry on in the face of errors? Would rejecting the file (or invalid settings) and then stopping be a valid option? For example: is there an error LED somewhere we can use to indicate failure?

We don't have a separate LED we could use to indicate to failure. We could use the OK LED on the KCH to indicate this, perhaps a reuse of the red colour.

The current approach here does improve things, as there's now some diagnostic information, however it still feels relatively easy to miss that an error file has been created. Additionally, it looks like the error file will persist even when the settings read was successful which could complicate things on subsequent runs.

Excellent point. Would you suggest deleting the error file (if it exists) where the settings file has been successfully parsed?

It may be worth thinking through what sorts of scenarios can cause errors here. For example: could it be that someone has customised their WiFi config, had it working, then tweaked some other aspect of the config and now finds they can't connect to the WiFi due to the settings having all been regenerated? That sort of thing feels like it could end up in blind alleys around "why has my WiFi stopped working (but the robot still ran)" which miss the fact that the WiFi is actually working just with different credentials.

There is an indicator on the robot to show that the WiFi is working. I'm not sure that's applicable to all potential settings though.

Having the old config output as part of the error file feels like a good aspect here, however I'm not really convinced that the "regenerate and replace" approach is a good one to start with. I would encourage only modifying what's on the USB stick if the file (or perhaps individual settings) are completely missing and preferring not to modify something the user has provided (even if what they've provided is garbage). Following from that I'd encourage gracefully degrading or loudly complaining about the errors (or both) rather than quietly making things work implicitly. While perhaps less convenient I would expect doing so to lead to much more "obvious" behaviours both for competitors and implementers.

(Am I missing something which makes that sort of approach unfeasible?)

From an implementation perspective, this is potentially difficult as the settings file is parsed in a different process from where the code is executed. We would have to think carefully how to avoid the code starting when the settings file is invalid, the current service architecture just doesn't work well for that scnenario.

I'd also be concerned about a robot failing to run in a competition match because the settings file became corrupted.

Regenerating the file is the current behaviour for an error, but just completely silently.

I think in the future there is an aim to have the settings editable from the web ui instead anyway, which would mitigate most of these issues. That is very much not in scope until SR2024 though.

PeterJCLaw · 2022-11-19T20:43:20Z

The current approach here does improve things, as there's now some diagnostic information, however it still feels relatively easy to miss that an error file has been created. Additionally, it looks like the error file will persist even when the settings read was successful which could complicate things on subsequent runs.

Excellent point. Would you suggest deleting the error file (if it exists) where the settings file has been successfully parsed?

I wondered about this, however it feels like this could lead to undesirable things too. One example is that running the robot twice after breaking the settings file leaves no trace of the error and completely erases the original (broken) settings file.

From an implementation perspective, this is potentially difficult as the settings file is parsed in a different process from where the code is executed. We would have to think carefully how to avoid the code starting when the settings file is invalid, the current service architecture just doesn't work well for that scnenario.

Given the presence of the usercode entry point in the settings file I assume there's already a strict ordering to things though? What would happen if we just never provided the usercode entry point from the service that's parsing the settings? (It feels like that should have a similar effect -- the code never runs)

For clarity: I also think it would be reasonable to only degrade the parts of the settings which aren't valid -- for example if the WiFi details aren't valid but the usercode entry point is, then running the code but disabling the WiFi might be a better solution than fully regenerating the file.

I'd also be concerned about a robot failing to run in a competition match because the settings file became corrupted.

Good point.

As the settings file includes the usercode entry point this already seems possible with the regeneration route?
Bear in mind that the regeneration path could also lead to running the wrong code as an alternative outcome to not running any code. This (wrong code) feels like it would be quite hard to diagnose and competitors would probably feel it was our fault if it happened.

I hadn't spotted this (wrong code) scenario before, though this is the sort of thing I mean by the behaviour not being "obvious". It also feels a pretty compelling reason to me to avoid the regeneration route and focus on improving the feedback to the user that something is wrong.

trickeydan · 2022-11-20T17:24:56Z

The current approach here does improve things, as there's now some diagnostic information, however it still feels relatively easy to miss that an error file has been created. Additionally, it looks like the error file will persist even when the settings read was successful which could complicate things on subsequent runs.

Excellent point. Would you suggest deleting the error file (if it exists) where the settings file has been successfully parsed?

I wondered about this, however it feels like this could lead to undesirable things too. One example is that running the robot twice after breaking the settings file leaves no trace of the error and completely erases the original (broken) settings file.

Good point. I think we should continue to leave it there, in the same manner as log.txt

From an implementation perspective, this is potentially difficult as the settings file is parsed in a different process from where the code is executed. We would have to think carefully how to avoid the code starting when the settings file is invalid, the current service architecture just doesn't work well for that scnenario.

Given the presence of the usercode entry point in the settings file I assume there's already a strict ordering to things though? What would happen if we just never provided the usercode entry point from the service that's parsing the settings? (It feels like that should have a similar effect -- the code never runs)

Oh I wish. The settings file is actually read twice, both in astmetad (this PR) and astprocd. I've created #186 to document this tech debt. We should definitely sort this out and introduce a strict ordering, although I'm very uneasy about doing that mid-cycle.

@PeterJCLaw: Would you be okay with us leaving this as is for this year, documenting the changes we need and then introducing strict ordering for next competition year?

As the settings file includes the usercode entry point this already seems possible with the regeneration route? Bear in mind that the regeneration path could also lead to running the wrong code as an alternative outcome to not running any code. This (wrong code) feels like it would be quite hard to diagnose and competitors would probably feel it was our fault if it happened.

I hadn't spotted this (wrong code) scenario before, though this is the sort of thing I mean by the behaviour not being "obvious". It also feels a pretty compelling reason to me to avoid the regeneration route and focus on improving the feedback to the user that something is wrong.

I'm somewhat regretting adding the usercode_entrypoint setting now to be honest. I can think of three choices here:

Leaving it as is. It wasn't an issue last year, although we also didn't tell them about the settings file.
Attempting to mitigate the diagnosis issues. My best idea is to add a line at the top of the log file stating what the entrypoint was.
Removing the usercode_entrypoint setting outright. I think we could get away with this mid-year.

@raccube I'd be interested to hear your thoughts here please

PeterJCLaw · 2022-11-20T18:54:33Z

Is there an option to remove the regeneration of the file in cases where it already exists? If it's read twice then it sounds like the regeneration already has race conditions associated with it.
Following from that updated behaviour, disabling features that depended on unavailable settings seems reasonable to me. Having this error file would then help identify issues. If we're able to also have an LED (or something) show an error at boot in such cases that's a bonus.

@PeterJCLaw: Would you be okay with us leaving this as is for this year, documenting the changes we need and then introducing strict ordering for next competition year?

I'm not completely clear on what you mean by the "this" here. If you mean the case of running the user code even when the WiFi settings are broken, I think that's fine (perhaps even in the long term as long) as the feedback to the user is clear.
If you mean something else could you clarify?

trickeydan · 2022-11-22T18:00:49Z

Is there an option to remove the regeneration of the file in cases where it already exists?

Not feasible without a re-architecture to handle the robot settings under a separate service. (i.e read in one place*)

*: Technically it will still need to be read in two places, but one read is strictly ahead of the other as it takes place when disk type is determined.

If it's read twice then it sounds like the regeneration already has race conditions associated with it.

Yes. There is a dodgy sleep statement that handles this at the moment. The above separate service would probably resolve these issues too.

Is there an option to remove the regeneration of the file in cases where it already exists? If it's read twice then it sounds like the regeneration already has race conditions associated with it. Following from that updated behaviour, disabling features that depended on unavailable settings seems reasonable to me. Having this error file would then help identify issues. If we're able to also have an LED (or something) show an error at boot in such cases that's a bonus.

I agree that this would be desirable. I don't think it's feasible to make the changes required mid-competition.

I'm not completely clear on what you mean by the "this" here. If you mean the case of running the user code even when the WiFi settings are broken, I think that's fine (perhaps even in the long term as long) as the feedback to the user is clear. If you mean something else could you clarify?

More so that this is strictly an improvement over the current behaviour, even if the behaviour is still very much imperfect.

I think we can cover the entrypoint mess by adding a note in the docs. I'll add a PR shortly.

Issue #187 created so that we do not forget about this.

When a robot-settings.toml file is bad, generate a robot-settings-error.txt file on the USB drive also. This file contains a message to tell the user that their settings has been regenerated, a detailed error for why the settings were invalid and in most cases a copy of the old settings file. An error file is not generated if the file is missing, as this is the expected behaviour for a new USB drive.

Whilst we don't allow non-ascii chars to be specified by the user, technically we need to encode this to check it

PeterJCLaw · 2022-11-23T18:42:37Z

For my understanding/posterity -- could you expand on what the failure mode is if we didn't regenerate the file? i.e: just left the "bad" one in place?

trickeydan · 2022-11-23T18:52:09Z

For my understanding/posterity -- could you expand on what the failure mode is if we didn't regenerate the file? i.e: just left the "bad" one in place?

With our current architecture, exceptions would be thrown in all Astoria daemons and no metadata would be processed. This would result in code never running because the mqtt messages get thrown away. It's a really bad idea.

It's not even really possible either. Pydantic is used to parse and validate the file and then that struct is used everywhere. There can't really be invalid data.

We will fix this, just not now.

PeterJCLaw · 2022-11-23T18:56:46Z

Ah, ok. I had thought I'd seen some checks for things like the WiFi name being the right shape (having a TLA) -- i.e: validation of the content of the data rather than parsing errors. Does that case behave differently?

trickeydan added 2 commits November 19, 2022 14:08

astmetad: Enforce strict parsing on user settings

6886f24

astmetad: Improve validation of user settings

5a9f821

Fixes #175

trickeydan added enhancement New feature or request A | astmetad This relates to astmetad labels Nov 19, 2022

trickeydan requested review from raccube and WillB97 November 19, 2022 14:28

WillB97 approved these changes Nov 19, 2022

View reviewed changes

trickeydan force-pushed the dgt/robot-settings-error-handling branch from a2b7099 to bffdfd5 Compare November 22, 2022 17:51

trickeydan mentioned this pull request Nov 22, 2022

Consider handling user settings as a separate service. #187

Open

trickeydan added 3 commits November 22, 2022 18:13

config/user: Encode SSID as UTF-8 before measuring length

d3e79be

Whilst we don't allow non-ascii chars to be specified by the user, technically we need to encode this to check it

tests: Add tests for user settings validation

7a19368

trickeydan force-pushed the dgt/robot-settings-error-handling branch from bffdfd5 to 7a19368 Compare November 22, 2022 18:14

raccube approved these changes Nov 22, 2022

View reviewed changes

trickeydan enabled auto-merge November 22, 2022 18:15

trickeydan merged commit 5e1adfc into main Nov 22, 2022

trickeydan deleted the dgt/robot-settings-error-handling branch November 22, 2022 18:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Robot Settings Error Handling #184

Robot Settings Error Handling #184

trickeydan commented Nov 19, 2022

PeterJCLaw commented Nov 19, 2022

trickeydan commented Nov 19, 2022

PeterJCLaw commented Nov 19, 2022

trickeydan commented Nov 20, 2022

PeterJCLaw commented Nov 20, 2022

trickeydan commented Nov 22, 2022

PeterJCLaw commented Nov 23, 2022

trickeydan commented Nov 23, 2022

PeterJCLaw commented Nov 23, 2022

Robot Settings Error Handling #184

Robot Settings Error Handling #184

Conversation

trickeydan commented Nov 19, 2022

PeterJCLaw commented Nov 19, 2022

trickeydan commented Nov 19, 2022

PeterJCLaw commented Nov 19, 2022

trickeydan commented Nov 20, 2022

PeterJCLaw commented Nov 20, 2022

trickeydan commented Nov 22, 2022

PeterJCLaw commented Nov 23, 2022

trickeydan commented Nov 23, 2022

PeterJCLaw commented Nov 23, 2022