-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(rolling-upgrade): Set of commits to fix rolling upgrade from 6.0-6.1 #7525
Conversation
Not sure, do we need backport to 6.0 |
Take a look at the 6.0 upgrade tests, and you'll know if it's needed or not. It couldn't hurt does it ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the job you mentioned with the fix, didn't reached that part at all
d911caf
to
3f10ef4
Compare
@aleksbykov please mark it as a draft - there are many commented out parts which look like testing only. |
06c1628
to
b07f1b9
Compare
b07f1b9
to
fb2ada6
Compare
@aleksbykov it would be helpful if you would communicate a bit of the situation, just marking people as reviewer, doesn't mean it would get attention. I wasn't even aware this PR is fixing issues that I was assuming tablets team should be fixing(and they didn't as well) |
fb2ada6
to
a2ea3ee
Compare
I agree my fault, that i didn't noticed. i just wait the fix of scylladb/scylladb#18995 which should be fast, to remove the not need commit, but looks like it is delayed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few suggestions/comments
mainly one should consider the mix of disabling raft topology and tablets, and that those test would still be operational
@fruch , can you take a look?
I decided not add new sct parameter which could control behavior of enabling/disabling cdc with tablets, and left version_cdc_support method without changes. |
sdcm/fill_db_data.py
Outdated
@@ -3179,6 +3185,12 @@ def version_cdc_support(self): | |||
version_with_support = self.CDC_SUPPORT_MIN_VERSION | |||
return self.parsed_scylla_version >= version_with_support | |||
|
|||
@optional_cached_property |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why optional_cached_property
? it can return None ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i set it as optional for future purposes, but you right. set it as cached_property
upgrade_test.py
Outdated
if not output: | ||
for node in self.db_cluster.nodes: | ||
with self.db_cluster.cql_connection_patient_exclusive(node) as session: | ||
output.extend(get_node_enabled_sstable_version(session)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this code ?
why not replacing this one with the previous ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed the variable to be more meaningful. and also i don't remove old code, which use logs, because master branch is also used with different version, where old functionality is used.
So if sstable_format was not parsed from logs (as in old versions) then it will be get from scylla table for new versions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reading it from logs doesn't works for new versions ?
reading it from system tables doesn't work for old versions ?
I'm a bit a lot, what exactly happened that we need to touch any of those parts
if we do, I would want to know why, and make sure we don't keep duplication for no real reason.
upgrade_test.py
Outdated
gemini_thread = self.run_gemini(self.params.get("gemini_cmd")) | ||
# TODO: workaround for issue #16317 | ||
if self.version_cdc_support() and not self.tablets_enabled: | ||
gemini_cmd += " --table-options \"cdc={'enabled': true}\"" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what's the point of running gemini without cdc ?
if the whole purpose it was introduce was for testing cdc...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not the whole point of running Gemini.
Let's not remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a bit change the code but save the main idea, to run gemini during rolling upgrade with and without cdc support
72b6428
to
7320b9f
Compare
Add new properties to fill_db_data class, which check is it possible to run cdc with tablets. Now we have issue scylladb/scyladb#16317 that cdc could n't be enabled with tablets if tablets are enabled and issue is opened, then cdc options will not be added to tables and gemini won't be run if tablets are disabled or issue will be fixed, then cdc options will be added to tables and gemini command with cdc will be run
Wait after upgrade/rollback that all nodes are up and normal this is required by raft, that all nodes are up and normal before any topology operations
Get enable sstable format from scylladb system table, because log file doesn't contain appropriate message starting from 6.0
7320b9f
to
3f1a53c
Compare
new :
|
@fruch can you take a look? |
return max(set(output)) | ||
enabled_sstable_format_features.extend(get_node_supported_sstable_versions(node.system_log)) | ||
|
||
if not enabled_sstable_format_features: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't understand this fallback.. and why it's here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@aleksbykov I would expect a follow up on sstable upgrade part, to clean it clarify it up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about 2024.2 ?
I think we just need to fix this change, s/autobackup/backup/
, and test it to be working
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does 2024.2 have jmx in it or not?
And do we have any documentation or change in procedure for upgrades with jmx and without?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now @yaronkaikov told me that 6.0 also doesn't have jmx installed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah but we didn't have the PR from @aleksbykov to remove it from SCT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, cause the issue is when upgrading from a version that doesn't have jmx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fruch / @syuu1228 IIUC, we need to revert this commit but only from branch-6.1. In master the tests are from version without jmx to version without jmx. But in 6.1 we do it from 6.0 that is with jmx (AFAIU) to 6.1 that is without jmx.
If we won't revert this commit on master, we should apply fix like this:
diff --git a/upgrade_test.py b/upgrade_test.py
index 7b8835318..4c9de0425 100644
--- a/upgrade_test.py
+++ b/upgrade_test.py
@@ -117,7 +117,7 @@ def recover_conf(node):
node.remoter.run(
r'for conf in $( rpm -qc $(rpm -qa | grep scylla) | grep -v contains ) '
r'/etc/systemd/system/{var-lib-scylla,var-lib-systemd-coredump}.mount; '
- r'do if test -e $conf.backup; then sudo cp -v $conf.backup $conf; fi; done')
+ r'do if test -e $conf.autobackup; then sudo cp -v $conf.autobackup $conf; fi; done')
else:
node.remoter.run(
r'for conf in $(cat /var/lib/dpkg/info/scylla-*server.conffiles '
Otherwise upgrade test will keep failing.
Since backup_conf() is backuping io.conf to io.conf.autobackup, but recover_conf() is trying to restore the file from io.conf.backup which is not available, and fails.
It is not about the patch purpose (restore only existant files since we dropping java things), it's about backup filename mismatch (maybe mistake on the patch?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it was a mistake in that patch
Since centos8 was deprecated quite before that change was merged, it wasn't noticed until we had centos9
You have the diff needed please open a PR, and then we could backport it to where it's needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it was a mistake in that patch
Since centos8 was deprecated quite before that change was merged, it wasn't noticed until we had centos9
You have the diff needed please open a PR, and then we could backport it to where it's needed.
Okay I will
Set of commits which are fix several issues in rolling upgrades from 6.0 to 6.1
fix(restore-backup): Restore only existent files - Fixing script error when try to restore backup config files which could be missing
fix(fill_db-UP): Temprorary disable cdc for rolling upgrade - Disable creating table with cdc options because it is not supported with tablets
fix(upgrade_test): Wait while all nodes up after rollback - need to wait after node upgrade/rollback that all nodes are up and normal. it is required by raft topology feature
fix(get_highest_sstable_version): get enabled sstable version from db - Log message with enabled format sstable is dissappeared from log file. Added new methods to get supported sstable format from system_local
fix(upgrade): no assert on upggrade sstable - change assert to log error for current sstable format.
Latest to commits could be scylladb/scylladb#18995
Testing
PR pre-checks (self review)
backport
labelsReminders
sdcm/sct_config.py
)unit-test/
folder)