fix(rolling-upgrade): Set of commits to fix rolling upgrade from 6.0-6.1 #7525

aleksbykov · 2024-06-02T05:30:35Z

Set of commits which are fix several issues in rolling upgrades from 6.0 to 6.1

fix(restore-backup): Restore only existent files - Fixing script error when try to restore backup config files which could be missing

fix(fill_db-UP): Temprorary disable cdc for rolling upgrade - Disable creating table with cdc options because it is not supported with tablets

fix(upgrade_test): Wait while all nodes up after rollback - need to wait after node upgrade/rollback that all nodes are up and normal. it is required by raft topology feature

fix(get_highest_sstable_version): get enabled sstable version from db - Log message with enabled format sstable is dissappeared from log file. Added new methods to get supported sstable format from system_local

fix(upgrade): no assert on upggrade sstable - change assert to log error for current sstable format.

Latest to commits could be scylladb/scylladb#18995

Testing

Job with fixes

PR pre-checks (self review)

I added the relevant backport labels
I didn't leave commented-out/debugging code

Reminders

Add New configuration option and document them (in sdcm/sct_config.py)
Add unit tests to cover my changes (under unit-test/ folder)
Update the Readme/doc folder relevant to this change (if needed)

aleksbykov · 2024-06-02T05:31:06Z

Not sure, do we need backport to 6.0

fruch · 2024-06-02T06:16:26Z

Not sure, do we need backport to 6.0

Take a look at the 6.0 upgrade tests, and you'll know if it's needed or not.

It couldn't hurt does it ?

fruch

@aleksbykov

the job you mentioned with the fix, didn't reached that part at all

soyacz · 2024-06-07T07:04:36Z

@aleksbykov please mark it as a draft - there are many commented out parts which look like testing only.

fruch · 2024-06-18T05:01:31Z

@aleksbykov it would be helpful if you would communicate a bit of the situation, just marking people as reviewer, doesn't mean it would get attention.

I wasn't even aware this PR is fixing issues that I was assuming tablets team should be fixing(and they didn't as well)

test-cases/upgrades/rolling-upgrade.yaml

aleksbykov · 2024-06-19T11:52:36Z

@aleksbykov it would be helpful if you would communicate a bit of the situation, just marking people as reviewer, doesn't mean it would get attention.

I wasn't even aware this PR is fixing issues that I was assuming tablets team should be fixing(and they didn't as well)

I agree my fault, that i didn't noticed. i just wait the fix of scylladb/scylladb#18995 which should be fast, to remove the not need commit, but looks like it is delayed.
The set of commits fixes regular rolling upgrades with disabling the cdc.
Could @scylladb/qa-maintainers review the pr?

sdcm/fill_db_data.py

yarongilor

LGTM

upgrade_test.py

fruch

few suggestions/comments

mainly one should consider the mix of disabling raft topology and tablets, and that those test would still be operational

aleksbykov · 2024-06-25T06:37:56Z

@fruch , can you take a look?
I added new property to check status of tablets. and change a bit gemini part in rolling_upgrade tests. Now if tablets is enabled gemini will be started without cdc and if tablets will be disabled, cdc options will be added to gemini command.
jobs are passed:

I decided not add new sct parameter which could control behavior of enabling/disabling cdc with tablets, and left version_cdc_support method without changes.
i think in next PR this method could be safely removed at all, because we don't have version without cdc support

upgrade_test.py

fruch · 2024-06-25T08:37:00Z

sdcm/fill_db_data.py

@@ -3179,6 +3185,12 @@ def version_cdc_support(self):
            version_with_support = self.CDC_SUPPORT_MIN_VERSION
        return self.parsed_scylla_version >= version_with_support

+    @optional_cached_property


why optional_cached_property ? it can return None ?

i set it as optional for future purposes, but you right. set it as cached_property

fruch · 2024-06-25T08:38:26Z

upgrade_test.py

+        if not output:
+            for node in self.db_cluster.nodes:
+                with self.db_cluster.cql_connection_patient_exclusive(node) as session:
+                    output.extend(get_node_enabled_sstable_version(session))


I don't understand this code ?

why not replacing this one with the previous ?

I renamed the variable to be more meaningful. and also i don't remove old code, which use logs, because master branch is also used with different version, where old functionality is used.
So if sstable_format was not parsed from logs (as in old versions) then it will be get from scylla table for new versions

reading it from logs doesn't works for new versions ?
reading it from system tables doesn't work for old versions ?

I'm a bit a lot, what exactly happened that we need to touch any of those parts

if we do, I would want to know why, and make sure we don't keep duplication for no real reason.

upgrade_test.py

fruch · 2024-06-25T08:42:02Z

upgrade_test.py

-            gemini_thread = self.run_gemini(self.params.get("gemini_cmd"))
+            # TODO: workaround for issue #16317
+            if self.version_cdc_support() and not self.tablets_enabled:
+                gemini_cmd += " --table-options \"cdc={'enabled': true}\""


I don't understand what's the point of running gemini without cdc ?

if the whole purpose it was introduce was for testing cdc...

It's not the whole point of running Gemini.
Let's not remove it.

a bit change the code but save the main idea, to run gemini during rolling upgrade with and without cdc support

Add new properties to fill_db_data class, which check is it possible to run cdc with tablets. Now we have issue scylladb/scyladb#16317 that cdc could n't be enabled with tablets if tablets are enabled and issue is opened, then cdc options will not be added to tables and gemini won't be run if tablets are disabled or issue will be fixed, then cdc options will be added to tables and gemini command with cdc will be run

Wait after upgrade/rollback that all nodes are up and normal this is required by raft, that all nodes are up and normal before any topology operations

Get enable sstable format from scylladb system table, because log file doesn't contain appropriate message starting from 6.0

aleksbykov · 2024-06-25T15:25:15Z

new :

removed not related commit after issue was fixed: Scylla is not using ME sstables by default scylladb#18995
removed old method and vars for version_cdc_support
set properities as cached_property
add new property to control enabling cdc options for tables based on feature tablets state and SkipPerIssue
rename var 'output' -> ''enabled_sstable_format_features' in method 'get_highest_supported_sstable_version' and leave support 2 behaviors get formats from logs and if not found from scylla table

aleksbykov · 2024-06-26T07:03:19Z

@fruch can you take a look?

fruch · 2024-06-30T08:43:56Z

upgrade_test.py

-        return max(set(output))
+            enabled_sstable_format_features.extend(get_node_supported_sstable_versions(node.system_log))
+
+        if not enabled_sstable_format_features:


I still don't understand this fallback.. and why it's here...

fruch

LGTM

fruch · 2024-06-30T08:47:54Z

@aleksbykov I would expect a follow up on sstable upgrade part, to clean it clarify it up.

roydahan · 2024-08-11T11:51:34Z

upgrade_test.py

@fruch / @syuu1228 IIUC, we need to revert this commit but only from branch-6.1.
In master the tests are from version without jmx to version without jmx.
But in 6.1 we do it from 6.0 that is with jmx (AFAIU) to 6.1 that is without jmx.

Am i right here?

what about 2024.2 ?

I think we just need to fix this change, s/autobackup/backup/, and test it to be working

Does 2024.2 have jmx in it or not?

And do we have any documentation or change in procedure for upgrades with jmx and without?

Now @yaronkaikov told me that 6.0 also doesn't have jmx installed.

ah but we didn't have the PR from @aleksbykov to remove it from SCT.

Yes, cause the issue is when upgrading from a version that doesn't have jmx

@fruch / @syuu1228 IIUC, we need to revert this commit but only from branch-6.1. In master the tests are from version without jmx to version without jmx. But in 6.1 we do it from 6.0 that is with jmx (AFAIU) to 6.1 that is without jmx.

If we won't revert this commit on master, we should apply fix like this:

diff --git a/upgrade_test.py b/upgrade_test.py index 7b8835318..4c9de0425 100644 --- a/upgrade_test.py +++ b/upgrade_test.py @@ -117,7 +117,7 @@ def recover_conf(node): node.remoter.run( r'for conf in $( rpm -qc $(rpm -qa | grep scylla) | grep -v contains ) ' r'/etc/systemd/system/{var-lib-scylla,var-lib-systemd-coredump}.mount; ' - r'do if test -e $conf.backup; then sudo cp -v $conf.backup $conf; fi; done') + r'do if test -e $conf.autobackup; then sudo cp -v $conf.autobackup $conf; fi; done') else: node.remoter.run( r'for conf in $(cat /var/lib/dpkg/info/scylla-*server.conffiles '

Otherwise upgrade test will keep failing.
Since backup_conf() is backuping io.conf to io.conf.autobackup, but recover_conf() is trying to restore the file from io.conf.backup which is not available, and fails.
It is not about the patch purpose (restore only existant files since we dropping java things), it's about backup filename mismatch (maybe mistake on the patch?).

Yes it was a mistake in that patch

Since centos8 was deprecated quite before that change was merged, it wasn't noticed until we had centos9

You have the diff needed please open a PR, and then we could backport it to where it's needed.

Yes it was a mistake in that patch

Since centos8 was deprecated quite before that change was merged, it wasn't noticed until we had centos9

You have the diff needed please open a PR, and then we could backport it to where it's needed.

Okay I will

aleksbykov added backport/none Backport is not required Ready for review labels Jun 2, 2024

aleksbykov requested review from fruch and soyacz June 2, 2024 05:30

github-actions bot assigned aleksbykov Jun 2, 2024

fruch requested changes Jun 2, 2024

View reviewed changes

aleksbykov force-pushed the fix-rolling-upgrade branch from d911caf to 3f10ef4 Compare June 6, 2024 16:19

aleksbykov marked this pull request as draft June 7, 2024 08:02

aleksbykov force-pushed the fix-rolling-upgrade branch 4 times, most recently from 06c1628 to b07f1b9 Compare June 11, 2024 13:56

aleksbykov changed the title ~~fix(restore-backup): Restore only existent files~~ fix(rolling-upgrade): Set of commits to fix rolling upgrade from 6.0-6.1 Jun 11, 2024

aleksbykov requested a review from fruch June 11, 2024 14:05

aleksbykov marked this pull request as ready for review June 11, 2024 14:05

aleksbykov force-pushed the fix-rolling-upgrade branch from b07f1b9 to fb2ada6 Compare June 14, 2024 10:06

fruch requested a review from yarongilor June 17, 2024 13:50

fruch mentioned this pull request Jun 17, 2024

all rolling upgrade are failing cause it's using CDC, and it doesn't work with tablets #7602

Open

2 tasks

yarongilor reviewed Jun 18, 2024

View reviewed changes

test-cases/upgrades/rolling-upgrade.yaml Show resolved Hide resolved

aleksbykov force-pushed the fix-rolling-upgrade branch from fb2ada6 to a2ea3ee Compare June 19, 2024 11:51

aleksbykov requested a review from yarongilor June 19, 2024 11:52

yarongilor reviewed Jun 19, 2024

View reviewed changes

sdcm/fill_db_data.py Outdated Show resolved Hide resolved

yarongilor previously approved these changes Jun 19, 2024

View reviewed changes

fruch reviewed Jun 20, 2024

View reviewed changes

upgrade_test.py Show resolved Hide resolved

fruch reviewed Jun 20, 2024

View reviewed changes

aleksbykov added the backport/6.0 label Jun 25, 2024