Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable tests/integration/standard #219

Merged
merged 16 commits into from
May 31, 2023

Conversation

Lorak-mmk
Copy link

Some integration tests were disabled in CI, and so have rotten a bit and were not passing - most notably, our shard-awareness tests.
This PR fixes / disables failing tests in tests/integration/standard and enables them in CI.

There are also some minor general fixes in tests (strict xfail, pip install -e). See commit messages for descriptions of each change.

ci/run_integration_test.sh Outdated Show resolved Hide resolved
@@ -37,8 +38,8 @@


def setup_module():
os.environ['SCYLLA_EXT_OPTS'] = "--smp 4 --memory 2048M"
use_cluster('shared_aware', [3], start=True)
os.environ['SCYLLA_EXT_OPTS'] = "--smp 2"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand you want to fit the test in GitHub action, but changing the number of shards changes the test logic.

The test is querying specific keys, that should make the query go to specific shard, when smp changes we need to pick new keys, or the target shard.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and I did change the keys - I took them from Java driver, as you can see in commit message. Java driver has very similar tests, and uses smp=2: https://github.com/scylladb/java-driver/blob/b3f3ebaf161b21e5c4840ec294595d4e4b39d9bf/driver-core/src/test/java/com/datastax/driver/core/ShardAwarenessTest.java

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I've written them initially

Anyhow we are using exactly the tracing for confirmation,

I would remove the assert before figuring out what changed in Scylla that we can't trust them anymore.

It might hide a bug in the driver code.

Also I was quite sure the java tests had them

@@ -62,7 +63,6 @@ def verify_same_shard_in_tracing(self, results, shard_name):
LOGGER.info("%s %s", event.thread_name, event.description)
for event in events:
self.assertEqual(event.thread_name, shard_name)
self.assertIn('querying locally', "\n".join([event.description for event in events]))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those assert at critical to this test, they are the confirmation that our query landed on the expected shard.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_closing_connections was consistently failing for me. I investigated why that is happening, and came to a conclusion that sometimes even when query is sent to a correct node and correct shard, Scylla can decide to delegate it's execution to another query - in that case 'querying locally' will not be present in the tracing log.

This was confirmed to me by @kbr-scylla on Slack.

The test still checks the shard number for each entry in tracing log - see 2 lines above the one referenced by this comment.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's a specific case, I would change it only for that case.

@Lorak-mmk
Copy link
Author

Rebased on master, reverted one changed requested by @fruch. There was one more failing test, and it passes for me locally - if it fails again I'll investigate.

Regarding removal of "querying locally" assert: I propose to merge this without restoring the assert - as it causes the test to fail. I'll open an issue about it, and look into it after coming back from PTO.

@Lorak-mmk
Copy link
Author

Lorak-mmk commented Mar 30, 2023

I don't get why this failed - the branch contains the --smp 1 fix that was merged to master.

@edit 2023-03-30 13:26:42.165 DEBUG [thread:57]: Connected to 2/2 shards on host 127.0.0.2:9042 (0 missing or needs replacement) <- a message from logs. There should be 1 shard, maybe the fix doesn't work for some reason?

@Lorak-mmk
Copy link
Author

Oh, it's probably not creating a new cluster but reusing the existing one

@Lorak-mmk
Copy link
Author

Ok, so test_authentication_misconfiguration.py seems to be flaky, I'll have to fix it...

@Lorak-mmk
Copy link
Author

Rebased on master - to see if my (now reverted) PR wasn't the cause of test failures.

Without `-e`, driver is installed into site-packages, and then installed
version conflicts with source. This has caused me problems many times
during development, because source version doesn't work (native libraries)
are not compiled.

`-e` causes driver to be installed in-place, so there are no more conflicts
and import issues.
Tests marked as xfailing should be failing - if they are not, it means
something in Scylla changed and we need to adapt the tests.
This attribute means that a passing xtest will cause a failure.
Renamed some decorators (those that I previously added) from lowercase
to lowercase with underscores (as advised by PEP 8 for functions - which
a decorator is). I changed newly added decorators, didn't yet touch
older ones - still need to decide wheter to do it.
It has some advantages, explained in the comment them.
For completeness, I'm copying this comment here:
# pytest.mark.xfail instead of unittest.expectedFailure because
# 1. unittest doesn't skip setUpClass when used on class and we need it sometimes
# 2. unittest doesn't have conditional xfail, and I prefer to use pytest than custom decorator
# 3. unittest doesn't have a reason argument, so you don't see the reason in pytest report

In the future all decorators should probably be switched over.
Previous code was incorrect and couldn't possibly work.
Scylla need more time to start which caused the test to fail.
Increasing to 60 seconds was not enough, 120 seems to work.
This test was previously disabled because required functionality was
not implemented in CCM.
This tests currently passes - and I manually inspected the logs to make
sure node3 really has authentication enabled.
Many tests failing on Scylla were labeled with unconditional xfail.
This has some problems:
- It's hard to tell why a test was marked.
- When some functionality is implemented in Scylla, we don't have an
    easy way to reenable tests that use this functionality.
- Test is skipped also when testing with Cassandra

This commit introduces more labels for failing tests. It fixes those problems:
- Label name and reason string explain why test is disabled
- We can edit label definition to enable tests on newer Scylla version
- Tests are only skipped in environment where they are expected to fail
instead of unittest.expectedFailure. The former has a `reason` argument,
and this reason is shown in test report - so it's easier to judge wheter
the test should really fail.
Tests running `sudo` are problematic to run locally during development,
as you can't just run the tests - you need to watch them and wait for
sudo prompt. In this test, it would be better to use some kind of proxy.

As the test was already marked xfail, skip it to make development easier.
Use shard values (keys, shard number) from Java driver tests.
Java driver uses values compatible with smp=2, so with those values
we'll be able to run those tests in CI (it has only 2 cores).
All the tests should be passing now, so they can be enabled.
One of the tests was failing in CI, even after previous fix.
This is probably caused by cluster being reused, instead of
being recreated with newly set env option.
When analyzing output of failed test it is useful to know which node emitted which message.
test_closing_connections is failing, for multiple reasons.
Skip it until proper investigation can be performed.
Copy link

@fruch fruch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Lorak-mmk
Copy link
Author

I don't want to merge it yet. There are flaky tests which I want to disable (and open issues about them), but the old logs expired so I rerun CI a few times - and now they started to pass of course ;_;

@Lorak-mmk
Copy link
Author

I don't get it - there were 1-3 tests that were failing with high probability, but now there are I think 5 successful runs in a row. Magic? I guess we can merge this, if failures reappear I'll disable the tests and open the issues.

@avelanarius avelanarius merged commit 5afb73d into scylladb:master May 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants