Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Sanity checking around backing up files #395

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

kiddom-kq
Copy link
Contributor

@kiddom-kq kiddom-kq commented Aug 24, 2021

While trying to deploy Medusa, i ran into a few issues. The root cause of #390 was an issue file systems permissions and the silent-failure property of python's glob()

This PR implements some of the debug logging that I wish I had had while troubleshooting and a basic sanity check to abort execution as soon as an error is observed rather than waiting for a (misleading) failure at a later point in execution.

┆Issue is synchronized with this Jira Task by Unito
┆friendlyId: K8SSAND-1398
┆priority: Medium

During node backup, the python process running as the `medusa` user was unable to locate snapshot files on disk. If no plausible directories with snapshot files exist on the host, consider this a failure condition and abort. Do not let execution proceed erroneously leading to a "backup successful cess" message.
@kiddom-kq
Copy link
Contributor Author

Would ValueError be more appropriate than the base Exception that's being raised? I didn't see too many custom exception types in the Medusa code base and of the few, none really fit this case. There are several places in the code bases where raise Exception(error_msg) is used so i'm not sure if there's some lint-disable comment i can leave there to prevent SonarCloud from objecting.

@sonarcloud
Copy link

sonarcloud bot commented Aug 30, 2021

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@adejanovski
Copy link
Contributor

@kiddom-kq, it looks like this broke the integration tests. You can run them locally using ./run_integration_tests.sh, with a prereq to have ccm installed (pip install ccm).
Let me know if you need some assistance to debug this.

@adejanovski
Copy link
Contributor

Would ValueError be more appropriate than the base Exception that's being raised? I didn't see too many custom exception types in the Medusa code base and of the few, none really fit this case. There are several places in the code bases where raise Exception(error_msg) is used so i'm not sure if there's some lint-disable comment i can leave there to prevent SonarCloud from objecting.

Any improvement getting a more specific use of exceptions is welcome. Feel free to make any change necessary.

@kiddom-kq
Copy link
Contributor Author

@adejanovski

I am having some trouble understanding the test.

 ./run_integration_tests.sh --test=16  --cassandra-version=2.2.19 -vv

That returns a failure because the assert statement is thrown.

What is supposed to happen when find_dirs() has no dirs?

Looking at the test:

    Scenario Outline: Perform a differential backup over gRPC , verify its index, then delete it over gRPC with Jolokia
        Given I have a fresh ccm cluster with jolokia "<client encryption>" running named "scenario16"
        Given I am using "<storage>" as storage provider in ccm cluster "<client encryption>" with gRPC server
        Then the gRPC server is up
        When I create the "test" table in keyspace "medusa"
        When I load 100 rows in the "medusa.test" table
        When I run a "ccm node1 nodetool flush" command
        When I perform a backup over gRPC in "differential" mode of the node named "grpc_backup"
        Then the backup index exists
        Then I verify over gRPC that the backup "grpc_backup" exists
        Then I can see the backup index entry for "grpc_backup"
        Then I can see the latest backup for "127.0.0.1" being called "grpc_backup"
        Then I delete the backup "grpc_backup" over gRPC
        Then I verify over gRPC the backup "grpc_backup" does not exist
        Then I shutdown the gRPC server

It looks like When I perform a backup over gRPC in "differential" mode of the node named "grpc_backup" is the first time that a backup is taken of the scenario16 cluster. If that is correct, how can a table with 100 rows in it have no directories that need to be backed up?

When I disable the assert, the execution continues to call add_backup_finish_to_index(storage, node_backup) and then set_latest_backup_in_index(storage, node_backup)

Those functions put a record of the backup happening in the storage/index... even though nothing was backed up.
The next line of the test: Then the backup index exists would see that something was added to the index and the test continues on happy to delete the (empty) backup that it just made.

Can you confirm that the differential backup should return 0 directories?

@adejanovski
Copy link
Contributor

@kiddom-kq, differential backups put the sstables into a data directory that's at the same level than the backup directories. So you should have something like this:
Capture d’écran 2021-10-07 à 15 57 54

While full backups use this layout:
Capture d’écran 2021-10-07 à 15 58 28

Those 100 rows should definitely be backed up in the data directory.

@adejanovski
Copy link
Contributor

Hi @kiddom-kq, is this still something you're working on?

@rzvoncek
Copy link
Contributor

ping @kiddom-kq

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants