-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dtest] test_table_alter_delete fails #17549
Comments
|
This is new to me (node 1):
|
See #15228 |
https://jenkins.scylladb.com/job/scylla-master/job/gating-dtest-release/8223/artifact/logs-full.release.002/1709104250457_lwt_schema_modification_test.py%3A%3ATestLWTSchemaModification%3A%3Atest_table_alter_delete/node1.log when testing #17554 |
@nyh - are you going to look at this failure? It seems like a very recent regression (I suspect/hope) . |
I wasn't planning to, but I'll do it now :-) |
Looking at https://jenkins.scylladb.com/view/nexts/job/scylla-master/job/next/7297/ the failing dtest is The failure is in the execute() line in: def run(self, session, stop, start_event, end_event):
if start_event:
start_event.wait() # Wait for other action to signal done
for i in range(self.row_start, self.row_end):
# Trigger scylladb/scylladb#6174 race condition with prepare and then execute
delete_cql = "DELETE FROM table1 WHERE pk = %i AND ck = %i %s" % (i, i, self.lwt)
delete_stmt = session.prepare(delete_cql)
delete_stmt.consistency_level = ConsistencyLevel.ALL
session.execute(delete_stmt) We get:
As @mykaul noticed, in the log https://jenkins.scylladb.com/job/scylla-master/job/gating-dtest-release-with-consistent-topology-changes/175/artifact/logs-full.release.002/last/node1.log there is:
I'm not at all familiar with what this test is doing or why. It has a comment saying that #6174 was a race condition (?) with prepare and execute. Perhaps a prepared statement had outdated information about the schema and was used to prepare a view update, but that shouldn't have been a problem - I'll take a look why it is, and if this is a recent regression like @mykaul suspects. Something which smells relevant is commit 5810396 ("Invalidate prepared statements for views when their schema changes.") for #16392, but I don't have enough understanding in this test or code to say yet if that's really true or why. |
The test has
In the above log lines, we can see that commit 5810396 sometimes worked as expected, we see the invalidation message for the view
What I still don't understand is why, even if a view update fail, the base write failed. Normally the base write doesn't need the view update to succeed. I'll try to see if I can reproduce this issue with a more reliable and simpler test. |
Another thing I can do is to check if I can reproduce this dtest failure locally and reliably (every time) and if so, do a bisection. But I'm guessing the failure isn't reliable and this won't work - so that's my second-favorite option. |
So I think a more accurate scenario is slightly different from what I wrote above:
|
See #17563 for another batch of logs, just in case it helps. |
I can solve one part of the puzzle, why it started I've enable this test in: as it was disabled for a long time for #6174, which was marked as fixed/solved I'll revert this commit |
@fruch let's establish some criteria for enabling tests |
We did, and it's running the test in CI 100 times, but we don't have to enforce it, so mistakes can still happen Anyhow this was reverted 2h ago. |
However, a view update is not considered "empty" if it's a row tombstone, and in this case this is what it should be - because the only updates are DELETE. It's a DELETE ... IF EXISTS, actually, but I can't think of any way this changes anything. So where does an "empty" view update come from while doing a DELETE .. IF EXISTS? |
Now that @fruch reverted the test enablement, @mykaul please reconsider if you really want this issue at P1, it's no longer a CI blocker but may indicate a real bug - one which we known about for a long time (#15228) so it will be good to hunt it down, but I'm not sure why it's urgent. Note that the revert broke the test and now it fails on an unrelated problem (https://github.com/scylladb/scylla-dtest/pull/3999#issuecomment-1971709895) so to try to reproduce this bug I need to un-revert the revert. So that's what I'm doing now. |
I just ran the unreverted test |
I was finally able to reproduce this bug locally. After undoing the test revert (run
(dtest-dbuild is my script for https://github.com/scylladb/scylla-dtest/issues/4005) The test failed 3 out of 10 times. In each of the three failures the log had on node1's log:
(the message was always on node 1, but the mutation was sometimes from node 3, sometimes from 2). |
@mykaul I see you reduced this issue from P1 to P2, but since I was just able to reproduce it locally, I want to spend a bit more time on it. If I don't make any progress on it today, I'll move to the a jillion other things I need to do. |
Converting the
It doesn't really help me understand, but not that I can fairly consistently (albeit slowly) reproduce this problem, I can add more printouts. |
Hi @denesb I'm zeroing in on this problem. It appears that we're reading mutation fragments outputted by I can modify the MV code to silently ignore the empty clustering row instead of throwing and breaking the test, but want to understand what I'm doing, and more importantly, I want to think there's a deeper bug here, perhaps in the mutation reader or something, that can bite us in other places as well where this empty clustering row can be returned and perhaps confuse other code as well (?). |
So it is already an empty row in the partition - no content or tombstone, I have no idea how it happened but it's not a problem of the flat mutation reader, which correctly returns an empty row. I have no idea what "cont: true" means :-( I'm continuing to investigate. |
As I suspected, the schema change is involved in this mess. The functions
Then schema_ptr base = schema();
m.upgrade(base); And after this upgrade, because the new schema had int_col deleted, an empty row is left in m:
This brings up a few questions I'll need to answer:
|
Beyond questions 2 and 3 above which I need to solve (that empty rows should have been allowed if an upgrade() create them) I'm also completely puzzled how the original The failure in the test clearly happens when the test is sending a DELETE request, The only place we did similar mutations was prepared INSERT statements in |
I verified that the initial write of the data in the test looked like this:
where we can see clearly see that this is an INSERT (there's a row marker) and also setting of So the mutation that sets only I wonder if that weird update with If that's is really the correct explanation, then the whole business of these extra int_col=1 mutations is benign, and there is no need to fix it - we just need MV to stop thinking that an empty row is bad (#15228) and then this issue can be closed. |
I looked at the read repair code (which I haven't reviewed in many years), and as far as I can tell
and as far as I can tell, the |
This one-line patch fixes a failure in the dtest lwt_schema_modification_test.py::TestLWTSchemaModification ::test_table_alter_delete Where an update sometimes failed due to an internal server error, and the log had the mysterious warning message: "std::logic_error (Empty materialized view updated)" We've also seen this log-message in the past in another user's log, and never understood what it meant. It turns out that the error message was generated (and warning printed) while building view updates for a base-table mutation, and noticing that the base mutation contains an *empty* row - a row with no cells or tombstone or anything whatsoever. This case was deemed (8 years ago, in d5a61a8) unexpected and nonsensical, and we threw an exception. But this case actually *can* happen - here is how it happened in test_table_alter_delete - which is a test involving a strange combination of materialized views, LWT and schema changes: 1. A table has a materialized view, and also a regular column "int_col". 2. A background thread repeatedly drops and re-creates this column int_col. 3. Another thread deletes rows with LWT ("IF EXISTS"). 4. These LWT operations each reads the existing row, and because of repeated drop-and-recreate of the "int_col" column, sometimes this read notices that one node has a value for int_col and the other doesn't, and creates a read-repair mutation setting int_col (the difference between the two reads includes just in this column). 5. The node missing "int_col" receives this mutation which sets only int_col. It upgrade()s this mutation to its most recent schema, which doesn't have int_col, so it removes this column from the mutation row - and is left with a completely empty mutation row. This completely empty row is not useful, but upgrade() doesn't remove it. 6. The view-update generation code sees this empty base-mutation row and fails it with this std::logic_error. 7. The node which sent the read-repair mutation sees that the read repair failed, so it fails the read and therefore fails the LWT delete operation. It is this LWT operation which failed in the test, and caused the whole test to fail. The fix is trivial: an empty base-table row mutation should simply be *ignored* when generating view updates - it shouldn't cause any error. Before this patch, test_table_alter_delete used to fail in roughly 20% of the runs on my laptop. After this patch, I ran it 100 times without a single failure. Fixes scylladb#15228 Fixes scylladb#17549 Signed-off-by: Nadav Har'El <nyh@scylladb.com>
I don't understand how this test still fails because @fruch reverted the introduction of that test (in dtest commit 3815457f2bd2c40f1205ed9af8be91f7a589e373). |
those links are 8 days old... |
yeah, these failure report were created before the very test was removed from gating. i was just annotating the failures in my PRs to ensure that all of them are tracked. |
This one-line patch fixes a failure in the dtest lwt_schema_modification_test.py::TestLWTSchemaModification ::test_table_alter_delete Where an update sometimes failed due to an internal server error, and the log had the mysterious warning message: "std::logic_error (Empty materialized view updated)" We've also seen this log-message in the past in another user's log, and never understood what it meant. It turns out that the error message was generated (and warning printed) while building view updates for a base-table mutation, and noticing that the base mutation contains an *empty* row - a row with no cells or tombstone or anything whatsoever. This case was deemed (8 years ago, in d5a61a8) unexpected and nonsensical, and we threw an exception. But this case actually *can* happen - here is how it happened in test_table_alter_delete - which is a test involving a strange combination of materialized views, LWT and schema changes: 1. A table has a materialized view, and also a regular column "int_col". 2. A background thread repeatedly drops and re-creates this column int_col. 3. Another thread deletes rows with LWT ("IF EXISTS"). 4. These LWT operations each reads the existing row, and because of repeated drop-and-recreate of the "int_col" column, sometimes this read notices that one node has a value for int_col and the other doesn't, and creates a read-repair mutation setting int_col (the difference between the two reads includes just in this column). 5. The node missing "int_col" receives this mutation which sets only int_col. It upgrade()s this mutation to its most recent schema, which doesn't have int_col, so it removes this column from the mutation row - and is left with a completely empty mutation row. This completely empty row is not useful, but upgrade() doesn't remove it. 6. The view-update generation code sees this empty base-mutation row and fails it with this std::logic_error. 7. The node which sent the read-repair mutation sees that the read repair failed, so it fails the read and therefore fails the LWT delete operation. It is this LWT operation which failed in the test, and caused the whole test to fail. The fix is trivial: an empty base-table row mutation should simply be *ignored* when generating view updates - it shouldn't cause any error. Before this patch, test_table_alter_delete used to fail in roughly 20% of the runs on my laptop. After this patch, I ran it 100 times without a single failure. Fixes scylladb#15228 Fixes scylladb#17549 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb#17607
https://jenkins.scylladb.com/view/nexts/job/scylla-master/job/next/7297/
https://jenkins.scylladb.com/view/nexts/job/scylla-master/job/next/7296/
The text was updated successfully, but these errors were encountered: