Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove broken memory connector test #18202

Merged
merged 1 commit into from
Jul 10, 2023

Conversation

martint
Copy link
Member

@martint martint commented Jul 10, 2023

The test attemps to ensure that queries will fail if one of the nodes holding a partition of the data goes offline. However, it's fundamentally broken, as it doesn't do anything to ensure that the node that's being shut down actually contains any data for the table in question.

With the recent changes in #18005 that enable writer scaling for any connector, it
is very likely that the table will land in a subset of the nodes and cause the test to fail if the "wrong" node is shut down.

Release notes

(x) This is not user-visible or docs only and no release notes are required.

The test attemps to ensure that queries will fail
if one of the nodes holding a partition of the data
goes offline. However, it's fundamentally broken,
as it doesn't do anything to ensure that the node
that's being shut down actually contains any data
for the table in question.

With the recent changes in trinodb#18005
that enable writer scaling for any connector, it
is very likely that the table will land in a subset
of the nodes and cause the test to fail if the "wrong"
node is shut down.
@findepi
Copy link
Member

findepi commented Jul 10, 2023

This is a connector test, but the assertion on No nodes available to run query message isn't referring to connector-specific behavior.
I would expect this test to be ensuring the memory connector doesn't return partial state of a table after worker goes down. This is an important aspect to test. (Testing this can perhaps require removing writer scaling or doing multiple inserts to create situation where multiple workers hold table data).
Do you happen to know where is this test?

@martint
Copy link
Member Author

martint commented Jul 10, 2023

The memory connector declares its splits as not being accessible remotely, so the node selector fails when the node that owns the split it is offline.

I agree that it could be tested properly, but this test does not do that. It does not have a mechanism to ensure the node that’s being shut down contains any data for that table.

Doing multiple inserts or disabling writer scaling is not a guarantee, so the test would be inherently flaky. In fact, a test should not make any blind assumptions about where the data will land after an insert.

@martint martint mentioned this pull request Jul 10, 2023
@findepi
Copy link
Member

findepi commented Jul 10, 2023

Doing multiple inserts or disabling writer scaling is not a guarantee, so the test would be inherently flaky. In fact, a test should not make any blind assumptions about where the data will land after an insert.

I understand there is probability in play and agree it's rather unfortunate.
I don't care about memory connector that much, but there are other connectors and other tests that have same challenge. For example to verify writer scaling itself we use data volumes that would trigger scaling. To verify Iceberg or Delta optimize we need tables with multiple files. To verify Iceberg and Delta write repartitioning we need data that without repartitioning would create too many too small files. Etc. I agree to remove this memory connector test, but I don't agree to general approach of removing imperfect tests, because same thinking would make us remove tests that are IMO too important to be removed.

@martint
Copy link
Member Author

martint commented Jul 10, 2023

I agree to remove this memory connector test, but I don't agree to general approach of removing imperfect tests

A test has to:

  • Work every time (no false negatives). If it fails every now and then for unexpected reasons, people will tend to ignore failures.
  • Test what it's supposed to test (no false positives). If it succeeds because it doesn't test what it's supposed to test, it provides a false sense of security.
  • Not make assumptions about implementation details (unless it's specifically designed to test those implementation details). If it makes assumptions about implementation details, e.g., as in the case of the test this PR is removing, it makes it harder to evolve the code.

@martint martint merged commit c2427c7 into trinodb:master Jul 10, 2023
47 checks passed
@martint martint deleted the memory-connector-test branch July 10, 2023 18:20
@github-actions github-actions bot added this to the 422 milestone Jul 10, 2023
@findepi
Copy link
Member

findepi commented Jul 12, 2023

Not make assumptions about implementation details

That's ideal, but I would rather have a test that makes some assumptions than no test at all.

  • Work every time (no false negatives). If it fails every now and then for unexpected reasons, people will tend to ignore failures.

I do agree with that.

I guess you're actually aware of the amount of work I personally put into coping with flaky tests.
It seems I didn't realize a more robust solution exists. Thanks!

@martint
Copy link
Member Author

martint commented Jul 12, 2023

That's ideal, but I would rather have a test that makes some assumptions than no test at all.

Agreed, but then when those assumptions become invalid the test needs to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

3 participants