Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates with TTL fail to replicate in a mixed cluster of 3.0 and 3.1 #4855

Closed
dyasny opened this issue Aug 15, 2019 · 25 comments · Fixed by #5214
Closed

Updates with TTL fail to replicate in a mixed cluster of 3.0 and 3.1 #4855

dyasny opened this issue Aug 15, 2019 · 25 comments · Fixed by #5214
Milestone

Comments

@dyasny
Copy link
Contributor

@dyasny dyasny commented Aug 15, 2019

Reported by @ultrabug via ticket #793

Installation details
Scylla version (or git commit hash):
3.0.9

Platform (physical/VM/cloud instance type/docker): Gentoo

@ultrabug built current master of scylla and tried upgrading a node of his current 3.0.9 staging cluster with it.

It failed with :

Aug 14 14:44:07 scy-st-1-p1 scylla: [shard 17] storage_proxy - Failed to apply mutation from 10.6.18.12#1: std::out_of_range (deserialization buffer underflow)on the master version node
Aug 14 14:43:58 scy-st-1-p2 scylla: [shard 1] storage_proxy - Failed to apply mutation from 10.6.18.11#24: std::runtime_error (Truncated frame)
on the other nodes

Logs available in ticket #793

@ultrabug

This comment has been minimized.

Copy link
Contributor

@ultrabug ultrabug commented Aug 16, 2019

Thanks for getting the discussion here

@dyasny just my 2 cents, those autolinks to support ticket using pound are misleading because they point to a GH issue

@dyasny

This comment has been minimized.

Copy link
Contributor Author

@dyasny dyasny commented Aug 16, 2019

agreed, I just didn't want to post enterprise ticket links in a public issue

@tzach

This comment has been minimized.

Copy link
Contributor

@tzach tzach commented Aug 26, 2019

@avikivity, please advise

@slivne

This comment has been minimized.

Copy link
Contributor

@slivne slivne commented Sep 8, 2019

we do not allow/test upgrade from 3.0 --> master (3.2), we still need to clear this out

@slivne slivne added this to the 3.2 milestone Sep 8, 2019
@avikivity

This comment has been minimized.

Copy link
Contributor

@avikivity avikivity commented Sep 8, 2019

There's nothing in principle that should prevent this, so it's a bug. Question is whether it's a 3.0->3.1 regression or a 3.1->master regresion.

@slivne

This comment has been minimized.

Copy link
Contributor

@slivne slivne commented Sep 8, 2019

@avikivity

This comment has been minimized.

Copy link
Contributor

@avikivity avikivity commented Sep 8, 2019

We should reproduce with a similar schema (3.0->master), then determine where the problem is.

@slivne

This comment has been minimized.

Copy link
Contributor

@slivne slivne commented Oct 6, 2019

I have tried to reproduce this by downsizing the schema to include all the types that they have ...

numberly_all_types.cql.txt

I ran a cluster with 3.0.9 and upgraded a single node to 3.1.0.rc8 and did writes in both old and new nodes

I then upgrade the node from 3.1.0.rc8 to master 3b9bf9d (so the cluster was one node in 3.09 and another in 3b9bf9d) and did writes on both nodes and it worked

I also extended the writes to include multiple elements

insert into ks.all_types (timestamp_k, text_k, boolean_k , bigint_k , date_k , smallint_k , blob_k , double_k , float_k , ascii_k , uuid_k , timeuuid_k , int_k , 
                          timestamp_c , text_c , boolean_c , bigint_c , date_c , smallint_c , blob_c , double_c , float_c , ascii_c , uuid_c , timeuuid_c , int_c, 
                          timestamp_v , text_v , boolean_v , bigint_v , date_v , smallint_v , blob_v , double_v , float_v , ascii_v , uuid_v , timeuuid_v , int_v,
                          list_text_v, list_bigint_v, frozen_list_double_v, frozen_list_bigint_v, frozen_list_text, frozen_set_text, set_text, frozen_map_text
) values ('2019-01-01','0',true,0,'2019-01-01',0,0x00,0.0,0.0,'0',now(),now(),0,
          '2019-01-01','0',true,0,'2019-01-01',0,0x00,0.0,0.0,'0',now(),now(),0,
          '2019-01-01','0',true,0,'2019-01-01',0,0x00,0.0,0.0,'0',now(),now(),0,
          ['0','1'],[0,1],[0.0, 1.1],[0, 1],['0', '1'],{'0', '1'},{'0', '1'},{'0':'0', '1':'1'});

I am not able to reproduce this (in a simple manner).

to be clear I did not create a schema with all the customer permutations but tried an inclusive schema of everything

@tarzanek

This comment has been minimized.

Copy link

@tarzanek tarzanek commented Oct 18, 2019

so it seems their problem is between 3.0.10 to 3.1

@tgrabiec

This comment has been minimized.

Copy link
Contributor

@tgrabiec tgrabiec commented Oct 18, 2019

Caused by 93270dd, which changed the representation of gc_clock::duration from 32-bit to 64-bit.

The IDL references it directly (which is also a weakness of our IDL):

class expiring_marker stub [[writable]] {
    live_marker lm;
    gc_clock::duration ttl;
    gc_clock::time_point expiry;
};
@tgrabiec tgrabiec changed the title Fail to upgrade from 3.0.9 to current master (LIKE feature testing) Updates with TTL fail to replicate in a mixed cluster of 3.0 and 3.1 Oct 18, 2019
@ultrabug

This comment has been minimized.

Copy link
Contributor

@ultrabug ultrabug commented Oct 18, 2019

@tgrabiec this indeed "solves" the problem on my 3.1.0 node that now does not spout any error message any more

BUT one of my 3.0.10 node spams logs with

Oct 18 16:41:45 scy-st-1-p2 scylla:  [shard 27] storage_proxy - Failed to apply mutation from 10.6.18.11#8: std::runtime_error (Truncated frame)
@ultrabug

This comment has been minimized.

Copy link
Contributor

@ultrabug ultrabug commented Oct 18, 2019

Please note that those logs are not present on the 3rd node running 3.0.10...

@tgrabiec

This comment has been minimized.

Copy link
Contributor

@tgrabiec tgrabiec commented Oct 18, 2019

@ultrabug Does it still happen if you wipe-out hinted handoff and commitlog directories on 3.1 (with the server stopped while doing it)?

@ultrabug

This comment has been minimized.

Copy link
Contributor

@ultrabug ultrabug commented Oct 18, 2019

@tgrabiec indeed the logs do not come up after dropping commitlog & hints directories!

@ultrabug

This comment has been minimized.

Copy link
Contributor

@ultrabug ultrabug commented Oct 18, 2019

So I will keep the 3.1.0 node up & running until tomorrow and monitor how it's going.

So far so good...

If that indeed fixes the case, I'd like to kindly ask for a quick 3.1.1 release please.

@ultrabug

This comment has been minimized.

Copy link
Contributor

@ultrabug ultrabug commented Oct 19, 2019

Following-up:

  • metrics looking good
  • no errors in logs
  • cqlsh working good
  • repair run from scylla-manager

One the repair is done, I'll report back.

Then I'll close support tickets and maintain the patch on my side to deploy on the rest of the cluster in the hope of a soon 3.1.1 (friendly hint for #5190).

@ultrabug

This comment has been minimized.

Copy link
Contributor

@ultrabug ultrabug commented Oct 21, 2019

^ repair works (not completed due to scylla-manger being too slow) so proceeding with upgrade on the other nodes

@ultrabug

This comment has been minimized.

Copy link
Contributor

@ultrabug ultrabug commented Oct 21, 2019

I reported another issue where a server which is not running a previous version won't restart if the majority of the cluster is running 3.1.0 (logs provided).

Scenario:

  • 2 nodes running 3.1.0
  • 1 node running 3.0.10

node running 3.0.10 will not restart and stay stuck after log:

[shard 3] compaction_manager - compaction failed: exceptions::mutation_write_failure_exception (Operation failed for system.compaction_history - received 0 responses and 1 failures from 1 CL=ONE.)

The failing node would only restart after being upgraded to 3.1.0

@tgrabiec

This comment has been minimized.

Copy link
Contributor

@tgrabiec tgrabiec commented Oct 21, 2019

@ultrabug

This comment has been minimized.

Copy link
Contributor

@ultrabug ultrabug commented Oct 21, 2019

@tgrabiec did try restart 3 times, every time I had to kill -9 scylla since it didn't want to stop properly

I now have upgraded the node, couldn't wait any longer

@tgrabiec

This comment has been minimized.

Copy link
Contributor

@tgrabiec tgrabiec commented Oct 23, 2019

@ultrabug The issue with restarts doesn't seem related to this bug.

Looks like you hit #4458, which caused your old node to auto-stop immediately after booting. compaction failure error was probably due to database stopping concurrently with compaction and rejecting local writes to system.compaction_history (a bug, but benign). The fact that auto-stop is hanging is another bug. To debug that we'd need to have a core dump or a reproducer.

@ultrabug

This comment has been minimized.

Copy link
Contributor

@ultrabug ultrabug commented Oct 23, 2019

rgr @tgrabiec thanks

hope of a soon 3.1.1
and
friendly hint for #5190

avikivity added a commit to avikivity/scylla that referenced this issue Oct 24, 2019
ommit 93270dd changed gc_clock to be 64-bit, to fix the Y2038
problem. While 64-bit tombstone::deletion_time is serialized in a
compatible way, TTLs (gc_clock::duration) were not.

This patchset reverts TTL serialization to the 32-bit serialization
format, and also allows opting-in to the 64-bit format in case a
cluster was installed with the broken code. Only Scylla 3.1.0 is
vulnerable.

Fixes scylladb#4855

Tests: unit (dev)
(cherry picked from commit e621db5)
avikivity added a commit that referenced this issue Oct 24, 2019
ommit 93270dd changed gc_clock to be 64-bit, to fix the Y2038
problem. While 64-bit tombstone::deletion_time is serialized in a
compatible way, TTLs (gc_clock::duration) were not.

This patchset reverts TTL serialization to the 32-bit serialization
format, and also allows opting-in to the 64-bit format in case a
cluster was installed with the broken code. Only Scylla 3.1.0 is
vulnerable.

Fixes #4855

Tests: unit (dev)
(cherry picked from commit e621db5)
@amoskong

This comment has been minimized.

Copy link
Contributor

@amoskong amoskong commented Oct 25, 2019

@amoskong

This comment has been minimized.

Copy link
Contributor

@amoskong amoskong commented Oct 25, 2019

We already enhanced our upgrade test by scylladb/scylla-cluster-tests#1411, then this issue can be easily reproduced by http://jenkins.scylladb.com/job/scylla-staging/job/amos/job/rolling-upgrade-clone-3.1/job/rolling-upgrade-centos7/11/

I just submitted to jobs to verify the issue is really fixed.

3.1.0-0.20191024.3f4d9f210 failed as expected.

storage_proxy - Failed to apply mutation from 10.142.0.79#1: std::runtime_error (Truncated frame)
Error decoding response from Cassandra. ver(3); flags(0000); stream(293); op(0); offset(9); len(133); buffer: '\x83\x00\x01%\x00\x00\x00\x00\x85\x00\x00\x15\x00\x00mOperation failed for keyspace1.timestamp_and_ttl_test - received 1 responses and 2 failures from 2 CL=QUORUM.\x00\x04\x00\x00\x00\x01\x00\x00\x00\x02\x00\x06SIMPLE'

> Traceback (most recent call last):
>   File "cassandra/connection.py", line 609, in cassandra.connection.Connection.process_msg
>     response = decoder(header.version, self.user_type_map, stream_id,
>   File "cassandra/protocol.py", line 1149, in cassandra.protocol._ProtocolHandler.decode_message
>     msg = msg_class.recv_body(body, protocol_version, user_type_map, result_metadata)
>   File "cassandra/protocol.py", line 133, in cassandra.protocol.ErrorMessage.recv_body
>     extra_info = subcls.recv_error_info(f, protocol_version)
>   File "cassandra/protocol.py", line 319, in cassandra.protocol.WriteFailureMessage.recv_error_info
>     write_type = WriteType.name_to_value[read_string(f)]
> KeyError: u'LE'

I didn't see TTL issue, but the truncate in boolean_test failed when all the nodes were upgraded to 3.1

ProtocolException: <Error from server: code=000a [Protocol error] message="Error during truncate: std::runtime_error (Can't find a column family boolean_test in keyspace keyspace1)">

I will try to reproduce both automatically & manually, and report a new issue.

@avikivity

This comment has been minimized.

Copy link
Contributor

@avikivity avikivity commented Dec 23, 2019

Backported everywhere needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants
You can’t perform that action at this time.