Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(rolling-upgrade): Set of commits to fix rolling upgrade from 6.0-6.1 #7525

Merged
merged 4 commits into from
Jun 30, 2024

Conversation

aleksbykov
Copy link
Contributor

@aleksbykov aleksbykov commented Jun 2, 2024

Set of commits which are fix several issues in rolling upgrades from 6.0 to 6.1

fix(restore-backup): Restore only existent files - Fixing script error when try to restore backup config files which could be missing

fix(fill_db-UP): Temprorary disable cdc for rolling upgrade - Disable creating table with cdc options because it is not supported with tablets

fix(upgrade_test): Wait while all nodes up after rollback - need to wait after node upgrade/rollback that all nodes are up and normal. it is required by raft topology feature

fix(get_highest_sstable_version): get enabled sstable version from db - Log message with enabled format sstable is dissappeared from log file. Added new methods to get supported sstable format from system_local

fix(upgrade): no assert on upggrade sstable - change assert to log error for current sstable format.

Latest to commits could be scylladb/scylladb#18995

Testing

PR pre-checks (self review)

  • I added the relevant backport labels
  • I didn't leave commented-out/debugging code

Reminders

  • Add New configuration option and document them (in sdcm/sct_config.py)
  • Add unit tests to cover my changes (under unit-test/ folder)
  • Update the Readme/doc folder relevant to this change (if needed)

@aleksbykov
Copy link
Contributor Author

Not sure, do we need backport to 6.0

@fruch
Copy link
Contributor

fruch commented Jun 2, 2024

Not sure, do we need backport to 6.0

Take a look at the 6.0 upgrade tests, and you'll know if it's needed or not.

It couldn't hurt does it ?

Copy link
Contributor

@fruch fruch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aleksbykov

the job you mentioned with the fix, didn't reached that part at all

@soyacz
Copy link
Contributor

soyacz commented Jun 7, 2024

@aleksbykov please mark it as a draft - there are many commented out parts which look like testing only.

@aleksbykov aleksbykov marked this pull request as draft June 7, 2024 08:02
@aleksbykov aleksbykov force-pushed the fix-rolling-upgrade branch 4 times, most recently from 06c1628 to b07f1b9 Compare June 11, 2024 13:56
@aleksbykov aleksbykov changed the title fix(restore-backup): Restore only existent files fix(rolling-upgrade): Set of commits to fix rolling upgrade from 6.0-6.1 Jun 11, 2024
@aleksbykov aleksbykov requested a review from fruch June 11, 2024 14:05
@aleksbykov aleksbykov marked this pull request as ready for review June 11, 2024 14:05
@fruch fruch requested a review from yarongilor June 17, 2024 13:50
@fruch
Copy link
Contributor

fruch commented Jun 18, 2024

@aleksbykov it would be helpful if you would communicate a bit of the situation, just marking people as reviewer, doesn't mean it would get attention.

I wasn't even aware this PR is fixing issues that I was assuming tablets team should be fixing(and they didn't as well)

@aleksbykov
Copy link
Contributor Author

@aleksbykov it would be helpful if you would communicate a bit of the situation, just marking people as reviewer, doesn't mean it would get attention.

I wasn't even aware this PR is fixing issues that I was assuming tablets team should be fixing(and they didn't as well)

I agree my fault, that i didn't noticed. i just wait the fix of scylladb/scylladb#18995 which should be fast, to remove the not need commit, but looks like it is delayed.
The set of commits fixes regular rolling upgrades with disabling the cdc.
Could @scylladb/qa-maintainers review the pr?

sdcm/fill_db_data.py Outdated Show resolved Hide resolved
yarongilor
yarongilor previously approved these changes Jun 19, 2024
Copy link
Contributor

@yarongilor yarongilor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

upgrade_test.py Show resolved Hide resolved
Copy link
Contributor

@fruch fruch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few suggestions/comments

mainly one should consider the mix of disabling raft topology and tablets, and that those test would still be operational

@aleksbykov
Copy link
Contributor Author

@fruch , can you take a look?
I added new property to check status of tablets. and change a bit gemini part in rolling_upgrade tests. Now if tablets is enabled gemini will be started without cdc and if tablets will be disabled, cdc options will be added to gemini command.
jobs are passed:

I decided not add new sct parameter which could control behavior of enabling/disabling cdc with tablets, and left version_cdc_support method without changes.
i think in next PR this method could be safely removed at all, because we don't have version without cdc support

upgrade_test.py Outdated Show resolved Hide resolved
upgrade_test.py Outdated Show resolved Hide resolved
@@ -3179,6 +3185,12 @@ def version_cdc_support(self):
version_with_support = self.CDC_SUPPORT_MIN_VERSION
return self.parsed_scylla_version >= version_with_support

@optional_cached_property
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why optional_cached_property ? it can return None ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i set it as optional for future purposes, but you right. set it as cached_property

upgrade_test.py Outdated
if not output:
for node in self.db_cluster.nodes:
with self.db_cluster.cql_connection_patient_exclusive(node) as session:
output.extend(get_node_enabled_sstable_version(session))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this code ?

why not replacing this one with the previous ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed the variable to be more meaningful. and also i don't remove old code, which use logs, because master branch is also used with different version, where old functionality is used.
So if sstable_format was not parsed from logs (as in old versions) then it will be get from scylla table for new versions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reading it from logs doesn't works for new versions ?
reading it from system tables doesn't work for old versions ?

I'm a bit a lot, what exactly happened that we need to touch any of those parts

if we do, I would want to know why, and make sure we don't keep duplication for no real reason.

upgrade_test.py Show resolved Hide resolved
upgrade_test.py Outdated Show resolved Hide resolved
upgrade_test.py Outdated
gemini_thread = self.run_gemini(self.params.get("gemini_cmd"))
# TODO: workaround for issue #16317
if self.version_cdc_support() and not self.tablets_enabled:
gemini_cmd += " --table-options \"cdc={'enabled': true}\""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what's the point of running gemini without cdc ?

if the whole purpose it was introduce was for testing cdc...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the whole point of running Gemini.
Let's not remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit change the code but save the main idea, to run gemini during rolling upgrade with and without cdc support

Add new properties to fill_db_data class, which check
is it possible to run cdc with tablets. Now we have issue
scylladb/scyladb#16317 that cdc could n't be enabled with tablets

if tablets are enabled and issue is opened, then
cdc options will not be added to tables and gemini won't be run
if tablets are disabled or issue will be fixed, then
cdc options will be added to tables and gemini command with cdc
will be run
Wait after upgrade/rollback that all nodes are up and normal
this is required by raft, that all nodes are up and normal
before any topology operations
Get enable sstable format from scylladb system table, because
log file doesn't contain appropriate message starting from 6.0
@aleksbykov
Copy link
Contributor Author

new :

  • removed not related commit after issue was fixed: Scylla is not using ME sstables by default scylladb#18995
  • removed old method and vars for version_cdc_support
  • set properities as cached_property
  • add new property to control enabling cdc options for tables based on feature tablets state and SkipPerIssue
  • rename var 'output' -> ''enabled_sstable_format_features' in method 'get_highest_supported_sstable_version' and leave support 2 behaviors get formats from logs and if not found from scylla table

@aleksbykov
Copy link
Contributor Author

@fruch can you take a look?

return max(set(output))
enabled_sstable_format_features.extend(get_node_supported_sstable_versions(node.system_log))

if not enabled_sstable_format_features:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand this fallback.. and why it's here...

Copy link
Contributor

@fruch fruch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fruch
Copy link
Contributor

fruch commented Jun 30, 2024

@aleksbykov I would expect a follow up on sstable upgrade part, to clean it clarify it up.

@fruch fruch merged commit 053a724 into scylladb:master Jun 30, 2024
7 checks passed
@fruch fruch added the backport/2024.2 Need backport to 2024.2 label Jul 15, 2024
@scylladbbot scylladbbot added backport/2024.2-done Commit backported to 2024.2 and removed backport/2024.2 Need backport to 2024.2 labels Jul 17, 2024
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fruch / @syuu1228 IIUC, we need to revert this commit but only from branch-6.1.
In master the tests are from version without jmx to version without jmx.
But in 6.1 we do it from 6.0 that is with jmx (AFAIU) to 6.1 that is without jmx.

Am i right here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about 2024.2 ?

I think we just need to fix this change, s/autobackup/backup/, and test it to be working

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does 2024.2 have jmx in it or not?

And do we have any documentation or change in procedure for upgrades with jmx and without?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now @yaronkaikov told me that 6.0 also doesn't have jmx installed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah but we didn't have the PR from @aleksbykov to remove it from SCT.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, cause the issue is when upgrading from a version that doesn't have jmx

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fruch / @syuu1228 IIUC, we need to revert this commit but only from branch-6.1. In master the tests are from version without jmx to version without jmx. But in 6.1 we do it from 6.0 that is with jmx (AFAIU) to 6.1 that is without jmx.

If we won't revert this commit on master, we should apply fix like this:

diff --git a/upgrade_test.py b/upgrade_test.py
index 7b8835318..4c9de0425 100644
--- a/upgrade_test.py
+++ b/upgrade_test.py
@@ -117,7 +117,7 @@ def recover_conf(node):
         node.remoter.run(
             r'for conf in $( rpm -qc $(rpm -qa | grep scylla) | grep -v contains ) '
             r'/etc/systemd/system/{var-lib-scylla,var-lib-systemd-coredump}.mount; '
-            r'do if test -e $conf.backup; then sudo cp -v $conf.backup $conf; fi; done')
+            r'do if test -e $conf.autobackup; then sudo cp -v $conf.autobackup $conf; fi; done')
     else:
         node.remoter.run(
             r'for conf in $(cat /var/lib/dpkg/info/scylla-*server.conffiles '

Otherwise upgrade test will keep failing.
Since backup_conf() is backuping io.conf to io.conf.autobackup, but recover_conf() is trying to restore the file from io.conf.backup which is not available, and fails.
It is not about the patch purpose (restore only existant files since we dropping java things), it's about backup filename mismatch (maybe mistake on the patch?).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it was a mistake in that patch

Since centos8 was deprecated quite before that change was merged, it wasn't noticed until we had centos9

You have the diff needed please open a PR, and then we could backport it to where it's needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it was a mistake in that patch

Since centos8 was deprecated quite before that change was merged, it wasn't noticed until we had centos9

You have the diff needed please open a PR, and then we could backport it to where it's needed.

Okay I will

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants