[dtest] test_table_alter_delete fails #17549

avikivity · 2024-02-27T20:05:51Z

https://jenkins.scylladb.com/view/nexts/job/scylla-master/job/next/7297/

https://jenkins.scylladb.com/view/nexts/job/scylla-master/job/next/7296/

mykaul · 2024-02-28T07:52:10Z

node4 (1).log
node2 (1).log
node3 (1).log
node1.log
dtest-gw4.log

mykaul · 2024-02-28T07:52:17Z

18:35:00,091 805     cassandra.policies             INFO     policies.py         :290  | test_table_alter_delete: Using datacenter 'datacenter1' for DCAwareRoundRobinPolicy (via host '127.0.91.1:9042'); if incorrect, please specify a local_dc to the constructor, or limit contact points to local cluster nodes
18:35:00,091 805     cassandra.cluster              INFO     dtest_setup.py      :493  | test_table_alter_delete: New Cassandra host <Host: 127.0.91.4:9042 datacenter1> discovered
18:35:00,092 805     cassandra.cluster              INFO     dtest_setup.py      :493  | test_table_alter_delete: New Cassandra host <Host: 127.0.91.3:9042 datacenter1> discovered
18:35:00,092 805     cassandra.cluster              INFO     dtest_setup.py      :493  | test_table_alter_delete: New Cassandra host <Host: 127.0.91.2:9042 datacenter1> discovered
18:35:10,190 805     errors                         ERROR    conftest.py         :202  | test_table_alter_delete: test failed: 
self = <lwt_schema_modification_test.TestLWTSchemaModification object at 0x7fa9e2235b90>

    def test_table_alter_delete(self):
        """Table alter test"""
>       self._test_combine([DropAddColumn(), DeleteRows(row_start=1, row_end=1000, lwt=True)], loops=3, run_s=10)

lwt_schema_modification_test.py:535: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
lwt_schema_modification_test.py:521: in _test_combine
    thread.result()
/usr/lib64/python3.11/concurrent/futures/_base.py:449: in result
    return self.__get_result()
/usr/lib64/python3.11/concurrent/futures/_base.py:401: in __get_result
    raise self._exception
/usr/lib64/python3.11/concurrent/futures/thread.py:58: in run
    result = self.fn(*self.args, **self.kwargs)
lwt_schema_modification_test.py:470: in _action_thread
    action.run(session, stop, start_event, end_event)  # Hand over control to action
lwt_schema_modification_test.py:293: in run
    session.execute(delete_stmt)
cassandra/cluster.py:2726: in cassandra.cluster.Session.execute
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   cassandra.WriteFailure: Error from server: code=1500 [Replica(s) failed to execute write] message="Operation failed for lwt_load_ks.table1 - received 1 responses and 1 failures from 2 CL=QUORUM." info={'consistency': 'QUORUM', 'required_responses': 2, 'received_responses': 1, 'failures': 1}

mykaul · 2024-02-28T07:55:17Z

This is new to me (node 1):

INFO  2024-02-27 18:35:00,121 [shard 0:strm] schema_tables - Schema version changed to e9ec1564-d59e-11ee-371e-531d09647616
WARN  2024-02-27 18:35:00,129 [shard 1:stmt] storage_proxy - Failed to apply mutation from 127.0.91.2#1: std::logic_error (Empty materialized view updated)

nyh · 2024-02-28T08:18:08Z

This is new to me (node 1):

INFO  2024-02-27 18:35:00,121 [shard 0:strm] schema_tables - Schema version changed to e9ec1564-d59e-11ee-371e-531d09647616
WARN  2024-02-27 18:35:00,129 [shard 1:stmt] storage_proxy - Failed to apply mutation from 127.0.91.2#1: std::logic_error (Empty materialized view updated)

See #15228

tchaikov · 2024-02-28T08:54:56Z

INFO  2024-02-28 07:10:37,367 [shard 0:strm] schema_tables - Schema version changed to 7906c744-d608-11ee-afc8-8c52362410db
WARN  2024-02-28 07:10:37,371 [shard 1:stmt] storage_proxy - Failed to apply mutation from 127.0.36.2#1: std::logic_error (Empty materialized view updated)
INFO  2024-02-28 07:10:38,611 [shard 0:strm] schema_tables - Altering lwt_load_ks.table1 id=770bd290-d608-11ee-bd56-5c2856feba47 version=79c4b101-d608-11ee-913a-d77f2eb49e2c

https://jenkins.scylladb.com/job/scylla-master/job/gating-dtest-release/8223/artifact/logs-full.release.002/1709104250457_lwt_schema_modification_test.py%3A%3ATestLWTSchemaModification%3A%3Atest_table_alter_delete/node1.log when testing #17554

mykaul · 2024-02-28T12:07:44Z

@nyh - are you going to look at this failure? It seems like a very recent regression (I suspect/hope) .

nyh · 2024-02-28T12:41:33Z

@nyh - are you going to look at this failure? It seems like a very recent regression (I suspect/hope) .

I wasn't planning to, but I'll do it now :-)

nyh · 2024-02-28T13:08:33Z

Looking at https://jenkins.scylladb.com/view/nexts/job/scylla-master/job/next/7297/ the failing dtest is lwt_schema_modification_test.py::TestLWTSchemaModification::test_table_alter_delete

The failure is in the execute() line in:

    def run(self, session, stop, start_event, end_event):
        if start_event:
            start_event.wait()  # Wait for other action to signal done

        for i in range(self.row_start, self.row_end):
            # Trigger scylladb/scylladb#6174 race condition with prepare and then execute
            delete_cql = "DELETE FROM table1 WHERE pk = %i AND ck = %i %s" % (i, i, self.lwt)
            delete_stmt = session.prepare(delete_cql)
            delete_stmt.consistency_level = ConsistencyLevel.ALL
            session.execute(delete_stmt)

We get:

E   cassandra.WriteFailure: Error from server: code=1500 [Replica(s) failed to execute write] message="Operation failed for lwt_load_ks.table1 - received 1 responses and 1 failures from 2 CL=QUORUM." info={'consistency': 'QUORUM', 'required_responses': 2, 'received_responses': 1, 'failures': 1}

As @mykaul noticed, in the log https://jenkins.scylladb.com/job/scylla-master/job/gating-dtest-release-with-consistent-topology-changes/175/artifact/logs-full.release.002/last/node1.log there is:

INFO  2024-02-27 18:34:58,488 [shard 0:strm] view - Finished building view lwt_load_ks.table1_v_idx_index
INFO  2024-02-27 18:34:58,489 [shard 0:strm] migration_manager - Gossiping my schema version e8f242aa-d59e-11ee-f4e9-54db7139c645
INFO  2024-02-27 18:34:58,489 [shard 0:strm] schema_tables - Schema version changed to e8f242aa-d59e-11ee-f4e9-54db7139c645
INFO  2024-02-27 18:35:00,120 [shard 0:strm] schema_tables - Altering lwt_load_ks.table1 id=e7d159b0-d59e-11ee-bc52-7fc197843a46 version=e9ebfc50-d59e-11ee-bc52-7fc197843a46
INFO  2024-02-27 18:35:00,120 [shard 0:strm] query_processor - Column definitions for lwt_load_ks.table1 changed, invalidating related prepared statements
INFO  2024-02-27 18:35:00,121 [shard 1:strm] query_processor - Column definitions for lwt_load_ks.table1 changed, invalidating related prepared statements
INFO  2024-02-27 18:35:00,121 [shard 3:strm] query_processor - Column definitions for lwt_load_ks.table1 changed, invalidating related prepared statements
INFO  2024-02-27 18:35:00,121 [shard 2:strm] query_processor - Column definitions for lwt_load_ks.table1 changed, invalidating related prepared statements
INFO  2024-02-27 18:35:00,121 [shard 0:strm] migration_manager - Gossiping my schema version e9ec1564-d59e-11ee-371e-531d09647616
INFO  2024-02-27 18:35:00,121 [shard 0:strm] schema_tables - Schema version changed to e9ec1564-d59e-11ee-371e-531d09647616
WARN  2024-02-27 18:35:00,129 [shard 1:stmt] storage_proxy - Failed to apply mutation from 127.0.91.2#1: std::logic_error (Empty materialized view updated)
NFO  2024-02-27 18:35:01,571 [shard 0:strm] schema_tables - Altering lwt_load_ks.table1 id=e7d159b0-d59e-11ee-bc52-7fc197843a46 version=eac96401-d59e-11ee-be06-00e6021860dc
INFO  2024-02-27 18:35:01,571 [shard 0:strm] schema_tables - Altering lwt_load_ks.table1_v_idx_index id=e8f23301-d59e-11ee-be06-00e6021860dc version=eac96400-d59e-11ee-be06-00e6021860dc
INFO  2024-02-27 18:35:01,572 [shard 0:strm] query_processor - Column definitions for lwt_load_ks.table1 changed, invalidating related prepared statements
INFO  2024-02-27 18:35:01,572 [shard 0:strm] query_processor - Column definitions for lwt_load_ks.table1_v_idx_index changed, invalidating related prepared statements
INFO  2024-02-27 18:35:01,572 [shard 1:strm] query_processor - Column definitions for lwt_load_ks.table1 changed, invalidating related prepared statements
INFO  2024-02-27 18:35:01,572 [shard 2:strm] query_processor - Column definitions for lwt_load_ks.table1 changed, invalidating related prepared statements
INFO  2024-02-27 18:35:01,572 [shard 3:strm] query_processor - Column definitions for lwt_load_ks.table1 changed, invalidating related prepared statements
INFO  2024-02-27 18:35:01,572 [shard 1:strm] query_processor - Column definitions for lwt_load_ks.table1_v_idx_index changed, invalidating related prepared statements
INFO  2024-02-27 18:35:01,572 [shard 3:strm] query_processor - Column definitions for lwt_load_ks.table1_v_idx_index changed, invalidating related prepared statements
INFO  2024-02-27 18:35:01,572 [shard 2:strm] query_processor - Column definitions for lwt_load_ks.table1_v_idx_index changed, invalidating related prepared statements

I'm not at all familiar with what this test is doing or why. It has a comment saying that #6174 was a race condition (?) with prepare and execute. Perhaps a prepared statement had outdated information about the schema and was used to prepare a view update, but that shouldn't have been a problem - I'll take a look why it is, and if this is a recent regression like @mykaul suspects.

Something which smells relevant is commit 5810396 ("Invalidate prepared statements for views when their schema changes.") for #16392, but I don't have enough understanding in this test or code to say yet if that's really true or why.

nyh · 2024-02-28T13:39:55Z

The test has

A table with a secondary index (the view's name is table1_v_idx_index).
A loop that continuously "ALTER TABLE table1 DROP int_col" and then adds it back - "ALTER TABLE table1 ADD int_col bigint.
A second loop that, concurently, prepares some single-row DELETE operation on the base table and then executes it. This execute fails.

In the above log lines, we can see that commit 5810396 sometimes worked as expected, we see the invalidation message for the view lwt_load_ks.table1_v_idx_index. But the error message actually happened before the first such invalidation message. So what I suspect happened (with no proof yet):

Node A executed a base-table delete with a schema including the regular column int_col. This prepared a view update matching this schema and sent it to node B.
By the time node B received the view update, the schema had already changed and int_col was deleted. In the view, int_col was the only regular column (table1 has pk,ck as the base key, v in the view key, and int_col - that's it), so perhaps the fact it disappeared confused the MV code to think the view update is empty, which is bad, and declared this an error.

What I still don't understand is why, even if a view update fail, the base write failed. Normally the base write doesn't need the view update to succeed.

I'll try to see if I can reproduce this issue with a more reliable and simpler test.

nyh · 2024-02-28T13:41:35Z

Another thing I can do is to check if I can reproduce this dtest failure locally and reliably (every time) and if so, do a bisection. But I'm guessing the failure isn't reliable and this won't work - so that's my second-favorite option.

nyh · 2024-02-28T13:56:25Z

The test has
1. A table with a secondary index (the view's name is `table1_v_idx_index`).

2. A loop that continuously "ALTER TABLE table1 DROP int_col" and then adds it back - "ALTER TABLE table1 ADD int_col bigint.

3. A second loop that, concurently, prepares some single-row DELETE operation on the base table and then executes it. This execute fails.
In the above log lines, we can see that commit 5810396 sometimes worked as expected, we see the invalidation message for the view lwt_load_ks.table1_v_idx_index. But the error message actually happened before the first such invalidation message. So what I suspect happened (with no proof yet):

What I still don't understand is why, even if a view update fail, the base write failed. Normally the base write doesn't need the view update to succeed.

So I think a more accurate scenario is slightly different from what I wrote above:

Node 2 is the coordinator of a base-table delete with a schema including the regular column int_col. It sent a base-replica write to node 1.
Node 1 received the base-replica update and prepared a view update, it decided there's an update to do (because int_col) existed, but then the column int_col was deleted. In the view, int_col was the only regular column (table1 has pk,ck as the base key, v in the view key, and int_col - that's it), so perhaps the fact it disappeared confused the MV code to think the view update is empty, which it considered bad, and declared this an error.

denesb · 2024-02-28T14:34:34Z

See #17563 for another batch of logs, just in case it helps.

fruch · 2024-02-28T15:08:59Z

@nyh

I can solve one part of the puzzle, why it started

I've enable this test in:
https://github.com/scylladb/scylla-dtest/commit/ff2a70b62294d58b3bd40a24e252c37e7dedd18f

as it was disabled for a long time for #6174, which was marked as fixed/solved
test was looking working good enough, but seem like it's not that stable

I'll revert this commit

avikivity · 2024-02-28T18:24:39Z

Again: https://jenkins.scylladb.com//job/scylla-master/job/next/7302/testReport/junit/lwt_schema_modification_test/TestLWTSchemaModification/Build___x86___dtest___test_table_alter_delete

avikivity · 2024-02-28T18:25:27Z

@fruch let's establish some criteria for enabling tests

fruch · 2024-02-28T19:55:30Z

@fruch let's establish some criteria for enabling tests

We did, and it's running the test in CI 100 times, but we don't have to enforce it, so mistakes can still happen

Anyhow this was reverted 2h ago.

nyh · 2024-02-29T09:08:20Z

Node 1 received the base-replica update and prepared a view update, it decided there's an update to do (because int_col) existed, but then the column int_col was deleted. In the view, int_col was the only regular column (table1 has pk,ck as the base key, v in the view key, and int_col - that's it), so perhaps the fact it disappeared confused the MV code to think the view update is empty, which it considered bad, and declared this an error.

However, a view update is not considered "empty" if it's a row tombstone, and in this case this is what it should be - because the only updates are DELETE. It's a DELETE ... IF EXISTS, actually, but I can't think of any way this changes anything. So where does an "empty" view update come from while doing a DELETE .. IF EXISTS?

nyh · 2024-02-29T18:32:22Z

Now that @fruch reverted the test enablement, @mykaul please reconsider if you really want this issue at P1, it's no longer a CI blocker but may indicate a real bug - one which we known about for a long time (#15228) so it will be good to hunt it down, but I'm not sure why it's urgent.

Note that the revert broke the test and now it fails on an unrelated problem (https://github.com/scylladb/scylla-dtest/pull/3999#issuecomment-1971709895) so to try to reproduce this bug I need to un-revert the revert. So that's what I'm doing now.

nyh · 2024-02-29T18:47:49Z

I just ran the unreverted test lwt_schema_modification_test.py::TestLWTSchemaModification::test_table_alter_delete 10 times, and it passed every time, so I still can't reproduce this issue :-(
I'll try later to run it even more times (it's slow, each run takes a few minutes).

nyh · 2024-03-03T11:13:44Z

I was finally able to reproduce this bug locally. After undoing the test revert (run git revert 3815457f2bd2c40f1205ed9af8be91f7a589e373) and running

SCYLLA=$HOME/scylla/build/dev/scylla ./dtest-dbuild --count 10 lwt_schema_modification_test.py::TestLWTSchemaModification::test_table_alter_delete

(dtest-dbuild is my script for https://github.com/scylladb/scylla-dtest/issues/4005)

The test failed 3 out of 10 times. In each of the three failures the log had on node1's log:

logs/1709452683911_lwt_schema_modification_test.py::TestLWTSchemaModification::test_table_alter_delete[9-10]/node1.log:WARN  2024-03-03 09:57:50,884 [shard 1:stmt] storage_proxy - Failed to apply mutation from 127.0.87.2#1: std::logic_error (Empty materialized view updated)

(the message was always on node 1, but the mutation was sometimes from node 3, sometimes from 2).
This message did not appear in any of the successful runs.

nyh · 2024-03-03T11:25:57Z

@mykaul I see you reduced this issue from P1 to P2, but since I was just able to reproduce it locally, I want to spend a bit more time on it. If I don't make any progress on it today, I'll move to the a jillion other things I need to do.

nyh · 2024-03-03T13:09:51Z

Converting the log_error to an on_internal_error and reproducing the failure again (this time I was less lucky, it only reproduced once in 20 runs), I see the following backtrace:

utils::on_internal_error(std::basic_string_view<char, std::char_traits<char> >) at on_internal_error.cc:?
db::view::view_update_builder::generate_update(clustering_row&&, std::optional<clustering_row>&&) at view.cc:?
db::view::view_update_builder::on_results() at view.cc:?
db::view::view_update_builder::build_some() at view.cc:?
replica::table::generate_and_propagate_view_updates(seastar::shared_ptr<db::view::view_update_generator>, seastar::lw_shared_ptr<schema const> const&, reader_permit, std::vector<db::view::view_and_base, std::allocator<db::view::view_and_base> >&&, mutation&&, seastar::optimized_optional<flat_mutation_reader_v2>, tracing::trace_state_ptr, std::chrono::time_point<gc_clock, std::chrono::duration<long, std::ratio<1l, 1l> > >) const at table.cc:?
replica::table::do_push_view_replica_updates(seastar::shared_ptr<db::view::view_update_generator>, seastar::lw_shared_ptr<schema const>, mutation, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >, mutation_source, tracing::trace_state_ptr, reader_concurrency_semaphore&, enum_set<super_enum<query::partition_slice::option, (query::partition_slice::option)0, (query::partition_slice::option)1, (query::partition_slice::option)2, (query::partition_slice::option)3, (query::partition_slice::option)4, (query::partition_slice::option)5, (query::partition_slice::option)6, (query::partition_slice::option)7, (query::partition_slice::option)8, (query::partition_slice::option)9, (query::partition_slice::option)10, (query::partition_slice::option)11, (query::partition_slice::option)12, (query::partition_slice::option)13> >) const at table.cc:?
replica::table::push_view_replica_updates(seastar::shared_ptr<db::view::view_update_generator>, seastar::lw_shared_ptr<schema const> const&, mutation&&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >, tracing::trace_state_ptr, reader_concurrency_semaphore&) const at table.cc:?
replica::table::push_view_replica_updates(seastar::shared_ptr<db::view::view_update_generator>, seastar::lw_shared_ptr<schema const> const&, frozen_mutation const&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >, tracing::trace_state_ptr, reader_concurrency_semaphore&) const at table.cc:?
replica::database::do_apply(seastar::lw_shared_ptr<schema const>, frozen_mutation const&, tracing::trace_state_ptr, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >, seastar::bool_class<db::force_sync_tag>, std::variant<std::monostate, db::per_partition_rate_limit::account_only, db::per_partition_rate_limit::account_and_enforce>) at database.cc:?

It doesn't really help me understand, but not that I can fairly consistently (albeit slowly) reproduce this problem, I can add more printouts.

nyh · 2024-03-03T14:57:10Z

Hi @denesb I'm zeroing in on this problem. It appears that we're reading mutation fragments outputted by make_flat_mutation_reader_from_mutations_v2(std::move(m_schema), std::move(permit), std::move(m)) fragment by fragment, and one of the fragments is a is_clustering_row(), but as_clustering_row()->empty() returns true - which the MV code asserts against. Before I go deeper down this rabbit hole I want to ask you if you can think of a reason why a mutation reader can validly generate an "empty" clustering row (which means it doesn't even have a tombstone!) - perhaps when there is a race with a schema change this can explain this (how?).

I can modify the MV code to silently ignore the empty clustering row instead of throwing and breaking the test, but want to understand what I'm doing, and more importantly, I want to think there's a deeper bug here, perhaps in the mutation reader or something, that can bite us in other places as well where this empty clustering row can be returned and perhaps confuse other code as well (?).

nyh · 2024-03-03T16:02:29Z

table::generate_and_propagate_view_updates() gets a mutation m which is:

{table: 'lwt_load_ks.table1', key: {'pk': 1, token: -4069959284402364209}, mutation_partition: {
  rows: [
    {
      cont: true,
      dummy: false,
      position: {
        bound_weight: 0,
        'ck': 1,
      },
    },
  ]
}
}

So it is already an empty row in the partition - no content or tombstone, I have no idea how it happened but it's not a problem of the flat mutation reader, which correctly returns an empty row. I have no idea what "cont: true" means :-( I'm continuing to investigate.

nyh · 2024-03-03T16:21:37Z

As I suspected, the schema change is involved in this mess. The functions database::do_apply() and table::do_push_view_replica_updates() (see backtrace above on which functions were involved) get the following base-table mutation m:

{table: 'lwt_load_ks.table1', key: {'pk': 1, token: -4069959284402364209}, mutation_partition: {
  rows: [
    {
      cont: true,
      dummy: false,
      position: {
        bound_weight: 0,
        'ck': 1,
      },
      'int_col': {atomic_cell{1,ts=1709482469950458,expiry=-1,ttl=0}},
    },
  ]
}
}

Then table::do_push_view_replica_updates() asks to "upgrade" this received mutation to the latest schema:

schema_ptr base = schema();
m.upgrade(base);

And after this upgrade, because the new schema had int_col deleted, an empty row is left in m:

{table: 'lwt_load_ks.table1', key: {'pk': 1, token: -4069959284402364209}, mutation_partition: {
  rows: [
    {
      cont: true,
      dummy: false,
      position: {
        bound_weight: 0,
        'ck': 1,
      },
    },
  ]
}
}

This brings up a few questions I'll need to answer:

Why does this workload, which I was sure only sent deletions, have a mutation that seems to set int_col instead of deleting it?
If we consider empty rows to be bad, upgrade() should remove rows which remain empty (if after the upgrade nothing is left)
Alternatively, we can say that empty rows are NOT bad, they can happen because of upgrade() and the MV code should silently ignore them, not complain.

nyh · 2024-03-03T16:41:22Z

Beyond questions 2 and 3 above which I need to solve (that empty rows should have been allowed if an upgrade() create them) I'm also completely puzzled how the original m shown in the above comment, which appears to be something like an UPDATE setting "int_col = 1 WHERE pk = 1 AND ck = 1" - came to exist in the first place.

The failure in the test clearly happens when the test is sending a DELETE request, "DELETE FROM table1 WHERE pk = %i AND ck = %i %s, via a prepared statement. This would generate a very different mutation from what we see here!

The only place we did similar mutations was prepared INSERT statements in _case_prologue() that inserted pk, ck, v, int_col to the same value - but that's different - even if we had a confusion in the prepared statement, such an insert would have also inserted v and a cql row marker, which our weird mutation didn't have. The wierd mutation looks like it only modified int_col and nothing else. Like an UPDATE. But I couldn't find such an operation in this test.

nyh · 2024-03-03T18:19:22Z

I verified that the initial write of the data in the test looked like this:

{table: 'lwt_load_ks.table1', key: {'pk': 1, token: -40699592
84402364209}, mutation_partition: {
  rows: [
    {
      cont: true,
      dummy: false,
      marker: {row_marker: 1709486692254919 0 0},
      position: {
        bound_weight: 0,
        'ck': 1,
      },
      'int_col': {atomic_cell{1,ts=1709486692254919,expiry=-1,ttl=0}},
      'v': {atomic_cell{1,ts=1709486692254919,expiry=-1,ttl=0}},
    },
  ]
}
}

where we can see clearly see that this is an INSERT (there's a row marker) and also setting of v, not just int_col.

So the mutation that sets only int_col is clearly something different.

I wonder if that weird update with int_col=1 might have been some sort of read-repair, where the two nodes noticed (perhaps due to the LWT operation that had to read the row?) that one of them had int_col and the other didn't and sent a mutation to add the missing int_col=1 (does read repair does that? I'm not at all sure... @gleb-cloudius maybe you remember if read-repair writes an entire row or just the differing columns?)?

If that's is really the correct explanation, then the whole business of these extra int_col=1 mutations is benign, and there is no need to fix it - we just need MV to stop thinking that an empty row is bad (#15228) and then this issue can be closed.

nyh · 2024-03-03T18:35:54Z

I looked at the read repair code (which I haven't reviewed in many years), and as far as I can tell future<std::optional<reconcilable_result>> resolve (storage_proxy.cc) has code like

               auto diff = v.par
                          ? m.partition().difference(*schema, (co_await v.par->mut().unfreeze_gently(schema)).partition())
                          : mutation_partition(*schema, m.partition());

and as far as I can tell, the difference() function recursively goes into partitions, rows collections and so on and finds the differences, so read repair can - if my understanding is correct - generate this mysterious "int_col=1" mutation.

This one-line patch fixes a failure in the dtest lwt_schema_modification_test.py::TestLWTSchemaModification ::test_table_alter_delete Where an update sometimes failed due to an internal server error, and the log had the mysterious warning message: "std::logic_error (Empty materialized view updated)" We've also seen this log-message in the past in another user's log, and never understood what it meant. It turns out that the error message was generated (and warning printed) while building view updates for a base-table mutation, and noticing that the base mutation contains an *empty* row - a row with no cells or tombstone or anything whatsoever. This case was deemed (8 years ago, in d5a61a8) unexpected and nonsensical, and we threw an exception. But this case actually *can* happen - here is how it happened in test_table_alter_delete - which is a test involving a strange combination of materialized views, LWT and schema changes: 1. A table has a materialized view, and also a regular column "int_col". 2. A background thread repeatedly drops and re-creates this column int_col. 3. Another thread deletes rows with LWT ("IF EXISTS"). 4. These LWT operations each reads the existing row, and because of repeated drop-and-recreate of the "int_col" column, sometimes this read notices that one node has a value for int_col and the other doesn't, and creates a read-repair mutation setting int_col (the difference between the two reads includes just in this column). 5. The node missing "int_col" receives this mutation which sets only int_col. It upgrade()s this mutation to its most recent schema, which doesn't have int_col, so it removes this column from the mutation row - and is left with a completely empty mutation row. This completely empty row is not useful, but upgrade() doesn't remove it. 6. The view-update generation code sees this empty base-mutation row and fails it with this std::logic_error. 7. The node which sent the read-repair mutation sees that the read repair failed, so it fails the read and therefore fails the LWT delete operation. It is this LWT operation which failed in the test, and caused the whole test to fail. The fix is trivial: an empty base-table row mutation should simply be *ignored* when generating view updates - it shouldn't cause any error. Before this patch, test_table_alter_delete used to fail in roughly 20% of the runs on my laptop. After this patch, I ran it 100 times without a single failure. Fixes scylladb#15228 Fixes scylladb#17549 Signed-off-by: Nadav Har'El <nyh@scylladb.com>

tchaikov · 2024-03-07T02:22:14Z

spotted at https://jenkins.scylladb.com/job/scylla-master/job/gating-dtest-release/8227

tchaikov · 2024-03-07T02:54:47Z

and https://jenkins.scylladb.com/job/scylla-master/job/gating-dtest-release-with-consistent-topology-changes/188

nyh · 2024-03-07T10:00:44Z

and https://jenkins.scylladb.com/job/scylla-master/job/gating-dtest-release-with-consistent-topology-changes/188

I don't understand how this test still fails because @fruch reverted the introduction of that test (in dtest commit 3815457f2bd2c40f1205ed9af8be91f7a589e373).

fruch · 2024-03-07T10:09:30Z

and https://jenkins.scylladb.com/job/scylla-master/job/gating-dtest-release-with-consistent-topology-changes/188

I don't understand how this test still fails because @fruch reverted the introduction of that test (in dtest commit 3815457f2bd2c40f1205ed9af8be91f7a589e373).

those links are 8 days old...

tchaikov · 2024-03-07T10:36:17Z

yeah, these failure report were created before the very test was removed from gating. i was just annotating the failures in my PRs to ensure that all of them are tracked.

This one-line patch fixes a failure in the dtest lwt_schema_modification_test.py::TestLWTSchemaModification ::test_table_alter_delete Where an update sometimes failed due to an internal server error, and the log had the mysterious warning message: "std::logic_error (Empty materialized view updated)" We've also seen this log-message in the past in another user's log, and never understood what it meant. It turns out that the error message was generated (and warning printed) while building view updates for a base-table mutation, and noticing that the base mutation contains an *empty* row - a row with no cells or tombstone or anything whatsoever. This case was deemed (8 years ago, in d5a61a8) unexpected and nonsensical, and we threw an exception. But this case actually *can* happen - here is how it happened in test_table_alter_delete - which is a test involving a strange combination of materialized views, LWT and schema changes: 1. A table has a materialized view, and also a regular column "int_col". 2. A background thread repeatedly drops and re-creates this column int_col. 3. Another thread deletes rows with LWT ("IF EXISTS"). 4. These LWT operations each reads the existing row, and because of repeated drop-and-recreate of the "int_col" column, sometimes this read notices that one node has a value for int_col and the other doesn't, and creates a read-repair mutation setting int_col (the difference between the two reads includes just in this column). 5. The node missing "int_col" receives this mutation which sets only int_col. It upgrade()s this mutation to its most recent schema, which doesn't have int_col, so it removes this column from the mutation row - and is left with a completely empty mutation row. This completely empty row is not useful, but upgrade() doesn't remove it. 6. The view-update generation code sees this empty base-mutation row and fails it with this std::logic_error. 7. The node which sent the read-repair mutation sees that the read repair failed, so it fails the read and therefore fails the LWT delete operation. It is this LWT operation which failed in the test, and caused the whole test to fail. The fix is trivial: an empty base-table row mutation should simply be *ignored* when generating view updates - it shouldn't cause any error. Before this patch, test_table_alter_delete used to fail in roughly 20% of the runs on my laptop. After this patch, I ran it 100 times without a single failure. Fixes scylladb#15228 Fixes scylladb#17549 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb#17607

avikivity added tests/dtest area/ci-stability labels Feb 27, 2024

mykaul added triage/master Looking for assignee P1 Urgent labels Feb 28, 2024

nyh mentioned this issue Feb 28, 2024

std::logic_error "Empty materialized view updated" #15228

Closed

tchaikov mentioned this issue Feb 28, 2024

build: cmake: exclude Seastar's tests from "all" #17554

Closed

mykaul mentioned this issue Feb 28, 2024

[dtest] lwt_schema_modification_test.py::TestLWTSchemaModification::test_table_alter_delete fails #17563

Closed

xemul mentioned this issue Feb 29, 2024

topology_coordinator: Demote log level for advance_in_background() er… #17568

Closed

nyh self-assigned this Feb 29, 2024

dani-tweig added symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework and removed area/ci-stability labels Feb 29, 2024

nuivall mentioned this issue Feb 29, 2024

Migrate system_auth to raft group0 #16578

Merged

mykaul added P2 High Priority and removed P1 Urgent symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework labels Feb 29, 2024

nyh mentioned this issue Mar 3, 2024

materialized views: fix rare failure caused by empty update #17607

Closed

mykaul removed the triage/master Looking for assignee label Mar 6, 2024

tchaikov mentioned this issue Mar 7, 2024

build: cmake: fixes for debian packaging #17560

Merged

This was referenced Mar 7, 2024

tools/scylla-nodetool: print hostname if --resolve-ip is passed to "ring" #17553

Merged

query-request: use default-generated operator== #17536

Closed

scylladb-promoter closed this as completed in 19bcea6 Mar 7, 2024

scylladb-promoter added the Backport candidate label Mar 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dtest] test_table_alter_delete fails #17549

[dtest] test_table_alter_delete fails #17549

avikivity commented Feb 27, 2024

mykaul commented Feb 28, 2024

mykaul commented Feb 28, 2024

mykaul commented Feb 28, 2024

nyh commented Feb 28, 2024

tchaikov commented Feb 28, 2024

mykaul commented Feb 28, 2024

nyh commented Feb 28, 2024

nyh commented Feb 28, 2024

nyh commented Feb 28, 2024

nyh commented Feb 28, 2024

nyh commented Feb 28, 2024

denesb commented Feb 28, 2024

fruch commented Feb 28, 2024

avikivity commented Feb 28, 2024

avikivity commented Feb 28, 2024

fruch commented Feb 28, 2024

nyh commented Feb 29, 2024 •

edited

nyh commented Feb 29, 2024

nyh commented Feb 29, 2024 •

edited

nyh commented Mar 3, 2024

nyh commented Mar 3, 2024 •

edited

nyh commented Mar 3, 2024

nyh commented Mar 3, 2024

nyh commented Mar 3, 2024 •

edited

nyh commented Mar 3, 2024 •

edited

nyh commented Mar 3, 2024

nyh commented Mar 3, 2024

nyh commented Mar 3, 2024

tchaikov commented Mar 7, 2024

tchaikov commented Mar 7, 2024

nyh commented Mar 7, 2024

fruch commented Mar 7, 2024

tchaikov commented Mar 7, 2024

[dtest] test_table_alter_delete fails #17549

[dtest] test_table_alter_delete fails #17549

Comments

avikivity commented Feb 27, 2024

mykaul commented Feb 28, 2024

mykaul commented Feb 28, 2024

mykaul commented Feb 28, 2024

nyh commented Feb 28, 2024

tchaikov commented Feb 28, 2024

mykaul commented Feb 28, 2024

nyh commented Feb 28, 2024

nyh commented Feb 28, 2024

nyh commented Feb 28, 2024

nyh commented Feb 28, 2024

nyh commented Feb 28, 2024

denesb commented Feb 28, 2024

fruch commented Feb 28, 2024

avikivity commented Feb 28, 2024

avikivity commented Feb 28, 2024

fruch commented Feb 28, 2024

nyh commented Feb 29, 2024 • edited

nyh commented Feb 29, 2024

nyh commented Feb 29, 2024 • edited

nyh commented Mar 3, 2024

nyh commented Mar 3, 2024 • edited

nyh commented Mar 3, 2024

nyh commented Mar 3, 2024

nyh commented Mar 3, 2024 • edited

nyh commented Mar 3, 2024 • edited

nyh commented Mar 3, 2024

nyh commented Mar 3, 2024

nyh commented Mar 3, 2024

tchaikov commented Mar 7, 2024

tchaikov commented Mar 7, 2024

nyh commented Mar 7, 2024

fruch commented Mar 7, 2024

tchaikov commented Mar 7, 2024

nyh commented Feb 29, 2024 •

edited

nyh commented Feb 29, 2024 •

edited

nyh commented Mar 3, 2024 •

edited

nyh commented Mar 3, 2024 •

edited

nyh commented Mar 3, 2024 •

edited