Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properties revert back to default value upon restart, upgrade related #15674

Closed
daisukebe opened this issue Dec 15, 2023 · 4 comments · Fixed by #15761
Closed

Properties revert back to default value upon restart, upgrade related #15674

daisukebe opened this issue Dec 15, 2023 · 4 comments · Fixed by #15761
Assignees
Labels
kind/bug Something isn't working

Comments

@daisukebe
Copy link
Contributor

daisukebe commented Dec 15, 2023

Version & Environment

Redpanda version: v23.2.16

In an upgraded cluster from v23.1.x, a property value reverts back to the original one unexpectedly.

Confirmed properties are:

  • space_management_enable
  • retention_local_strict

What went wrong?

See repro steps

What should have happened instead?

Saved value should not change

How to reproduce the issue?

  1. Set up an 23.1.x cluster
  2. Upgrade it to 23.2.x
  3. Confirm space_management_enable is false
  4. Change it to true via rpk; rpk cluster config set space_management_enable true
    a. Confirm the value is persisted; rpk cluster config get space_management_enable returns true
  5. Restart the broker
  6. rpk cluster config get space_management_enable returns false

Additional information

https://redpandadata.slack.com/archives/C06AT54L2M6

@daisukebe daisukebe added the kind/bug Something isn't working label Dec 15, 2023
@oleiman
Copy link
Member

oleiman commented Dec 15, 2023

Few thoughts while it’s fresh:

When we apply our feature table snapshot on startup, first we set_active_version to snapshot.version, which itself calls set_original_version with that version and tries the legacy default path. A bit later, we call set_original_version again this time with snapshot.original_version and traverse the legacy default path again, leaving the feature table original version at the pre-upgrade logical version:

// src/v/features/feature_table_snapshot.cc
void feature_table_snapshot::apply(feature_table& ft) const {
...
    ft.set_active_version(version);
...
    ft.set_applied_offset(applied_offset);
    ft.set_original_version(original_version);

    ft.on_update();
}

Maybe this is why the “old” original_version persists across a restart of the new version, forcing us back to the legacy default? Or is there some later update to feature_table::_original_version that’s been lost somehow?

Also, this condition is suspiciously broad:

    if (ov <= _legacy_default.value().max_original_version && is_default()) {

Where is_default() just checks whether the property’s current value is equal to the default. There’s no distinction as to whether the property took a default value or was set there. That’s presumably why @dotnwat's legacy default test didn’t display this behavior: it sets the property to default + 1 before restarting the node.

None of that's conclusive (and @BenPope stated most or all of it already), but this is about as much as I have my head around right now.

@oleiman oleiman self-assigned this Dec 15, 2023
@andijcr
Copy link
Contributor

andijcr commented Dec 15, 2023

I think i observed something similar some months ago, but I couldn't follow up and it fell through the cracks

#13362

in my case, it has to do with property aliases, but the effect is similar - a property resetting to a previous value. i captured this log where it shows that it's reapplying the alias name and the proper name in succession

TRACE 2023-09-08 09:15:47,340 [shard 0:main] cluster - config_manager.cc:313 - Loaded property cloud_storage_graceful_transfer_timeout=1235 from local cache
TRACE 2023-09-08 09:15:47,342 [shard 0:main] cluster - config_manager.cc:313 - Loaded property cloud_storage_graceful_transfer_timeout_ms=1234 from local cache

so fixing this should fix the same issue with aliases. hopefully what i wrote makes sense

@BenPope
Copy link
Member

BenPope commented Dec 16, 2023

I think i observed something similar some months ago, but I couldn't follow up and it fell through the cracks

Test cases welcome!

@andijcr
Copy link
Contributor

andijcr commented Dec 16, 2023

I think i observed something similar some months ago, but I couldn't follow up and it fell through the cracks

Test cases welcome!

#15263 this test with this pr except the last commit #15605 shows the effect. The last commit is an attempt to get around the underlying issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
4 participants