Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage proxy prints exception errors with bad ip addresses #3229

Closed
slivne opened this issue Feb 22, 2018 · 39 comments

Comments

Projects
None yet
9 participants
@slivne
Copy link
Contributor

commented Feb 22, 2018

Installation details
Scylla version (or git commit hash): 2.0.2

split from #3143

More information (reposted from Slack). In both case discussed above we found following strange errors in syslog, right before process died:

Jan 24 20:30:43 ip-172-30-0-96 scylla: [shard 1] storage_proxy - exception during mutation write to 0.0.96.16: std::system_error (error system:22, Invalid argument)
Jan 24 20:30:43 ip-172-30-0-96 scylla: [shard 1] storage_proxy - exception during mutation write to 0.0.96.16: std::system_error (error system:22, Invalid argument)
Jan 24 20:30:43 ip-172-30-0-96 scylla: [shard 2] storage_proxy - exception during mutation write to 0.0.96.32: std::system_error (error system:22, Invalid argument)
Jan 24 20:30:43 ip-172-30-0-96 scylla: [shard 2] storage_proxy - exception during mutation write to 0.0.96.32: std::system_error (error system:22, Invalid argument)

In these error message "0.0.96.16" is an IP address, other log lines had other random and absolutely unreal IPs:

Jan 24 21:41:48 ip-10-1-6-176 scylla: [shard 5] rpc - client 9.209.94.248: fail to connect: Connection timed out
Jan 24 21:41:48 ip-10-1-6-176 scylla: [shard 3] rpc - client 10.82.210.56: fail to connect: Connection timed out
Jan 24 21:41:50 ip-10-1-6-176 scylla: [shard 2] rpc - client 8.249.67.128: fail to connect: Connection timed out
Jan 24 21:41:50 ip-10-1-6-176 scylla: [shard 2] rpc - client 8.249.76.240: fail to connect: Connection timed out
...
Jan 24 21:41:57 ip-10-1-6-176 scylla: [shard 2] rpc - client 8.249.76.64: fail to connect: Connection timed out
Jan 24 21:41:59 ip-10-1-6-176 scylla: [shard 0] rpc - client 9.59.43.240: fail to connect: Connection timed out
Jan 24 21:42:13 ip-10-1-6-176 scylla: [shard 6] rpc - client 1.48.229.0: fail to connect: Connection timed out

This can be some sort of memory corruption or addressing issue.

@slivne

This comment has been minimized.

Copy link
Contributor Author

commented Feb 22, 2018

@asias / @gleb-cloudius - can you please check the code around the printouts

@slivne slivne added this to the 2.2 milestone Feb 22, 2018

@gleb-cloudius

This comment has been minimized.

Copy link
Contributor

commented Feb 22, 2018

@slivne

This comment has been minimized.

Copy link
Contributor Author

commented May 13, 2018

leaving this open, pushing this out - we do not have a lead at this point - we will try to run debug and dtests

@slivne slivne modified the milestones: 2.2, 2.3 May 13, 2018

@anilyadav

This comment has been minimized.

Copy link

commented Jul 10, 2018

We are facing issue 2.1.6 cluster. We are getting same error.

Jul 10 11:58:21 XXX scylla[3696]:  [shard 1] storage_proxy - exception during mutation write to 0.0.96.16: std::system_error (error system:22, Invalid argument)
Jul 10 11:58:23 XXX scylla[3696]:  [shard 1] storage_proxy - exception during mutation write to 0.62.110.0: std::system_error (error system:22, Invalid argument)
Jul 10 11:58:23 XXX scylla[3696]:  [shard 1] storage_proxy - exception during mutation write to 0.62.110.0: std::system_error (error system:22, Invalid argument)
@anilyadav

This comment has been minimized.

Copy link

commented Jul 10, 2018

It seems totally weird , We upgraded a 6 Node scyllaDB cluster to 2.1.6 (from Version 2.0.4). 4 nodes are working fine 2 node giving above error. We tried adding new nodes of 2.1.6, Still same issue.

@gleb-cloudius

This comment has been minimized.

Copy link
Contributor

commented Jul 10, 2018

@anilyadav

This comment has been minimized.

Copy link

commented Jul 10, 2018

@gleb-cloudius nodetool status says node are working While Application is facing issue and we can see above error in Error Log .

It's Showing all nodes up evereywhere

@gleb-cloudius

This comment has been minimized.

Copy link
Contributor

commented Jul 10, 2018

@anilyadav

This comment has been minimized.

Copy link

commented Jul 10, 2018

We can also see below random IP's in Error log

Jul  9 18:14:16 scyllanode1 scylla: [shard 0] rpc - client 1.232.131.163: fail to connect: Connection timed out
Jul  9 19:20:44 scyllanode1 scylla: [shard 2] rpc - client 2.12.0.195: fail to connect: Connection timed out
Jul  9 19:32:04 scyllanode1 scylla: [shard 2] rpc - client 64.0.17.14: fail to connect: Connection timed out
Jul  9 19:36:03 scyllanode1 scylla: [shard 2] rpc - client 2.12.0.195: fail to connect: Connection timed out
Jul  9 23:43:25 scyllanode1 scylla: [shard 1] rpc - client 146.72.133.176: fail to connect: Connection timed out
Jul 10 00:58:02 scyllanode1 scylla: [shard 1] rpc - client 1.232.131.219: fail to connect: Connection timed out
Jul 10 00:58:27 scyllanode1 scylla: [shard 1] rpc - client 13.38.46.144: fail to connect: Connection timed out
Jul 10 01:25:43 scyllanode1 scylla: [shard 1] rpc - client 220.240.148.160: fail to connect: Connection timed out
Jul 10 01:30:44 scyllanode1 scylla: [shard 1] rpc - client 1.232.131.224: fail to connect: Connection timed out
Jul 10 01:31:46 scyllanode1 scylla: [shard 1] rpc - client 180.81.115.176: fail to connect: Connection timed out
Jul 10 04:01:30 scyllanode1 scylla: [shard 0] rpc - client 1.232.131.245: fail to connect: Connection timed out
Jul 10 06:50:58 scyllanode1 scylla: [shard 0] rpc - client 253.54.205.48: fail to connect: Connection timed out
Jul 10 07:10:54 scyllanode1 scylla: [shard 1] rpc - client 1.189.27.208: fail to connect: Connection timed out
Jul 10 07:24:03 scyllanode1 scylla: [shard 1] rpc - client 1.189.27.208: fail to connect: Connection timed out
Jul 10 07:26:59 scyllanode1 scylla: [shard 1] rpc - client 1.189.27.208: fail to connect: Connection timed out
Jul 10 07:30:36 scyllanode1 scylla: [shard 1] rpc - client 1.189.27.208: fail to connect: Connection timed out
Jul 10 07:48:11 scyllanode1 scylla: [shard 1] rpc - client 1.189.27.208: fail to connect: Connection timed out
Jul 10 08:08:46 scyllanode1 scylla: [shard 1] rpc - client 1.189.27.208: fail to connect: Connection timed out
Jul 10 08:18:17 scyllanode1 scylla: [shard 1] rpc - client 1.189.27.208: fail to connect: Connection timed out
Jul 10 08:31:24 scyllanode1 scylla: [shard 1] rpc - client 1.189.27.208: fail to connect: Connection timed out
Jul 10 08:41:42 scyllanode1 scylla: [shard 1] rpc - client 1.189.27.208: fail to connect: Connection timed out
Jul 10 08:44:08 scyllanode1 scylla: [shard 1] rpc - client 1.189.27.208: fail to connect: Connection timed out
Jul 10 10:03:27 scyllanode1 scylla: [shard 2] rpc - client 32.144.0.0: fail to connect: Connection timed out

IP's are correct and Snitch is Ec2Snitch

@gleb-cloudius

This comment has been minimized.

Copy link
Contributor

commented Jul 10, 2018

@anilyadav

This comment has been minimized.

Copy link

commented Jul 10, 2018

We have only 1 DC(AWS) but we are using multizone setup. Output of SELECT peer,preferred_ip,rpc_address from system.peers; . IP's are same which we are using in cluster.

@anilyadav

This comment has been minimized.

Copy link

commented Jul 10, 2018

@gleb-cloudius We are created two rings in Same DC.

@gleb-cloudius

This comment has been minimized.

Copy link
Contributor

commented Jul 10, 2018

@gleb-cloudius

This comment has been minimized.

Copy link
Contributor

commented Jul 10, 2018

@anilyadav

This comment has been minimized.

Copy link

commented Jul 10, 2018

Yes IP's are same. Yea preferred_ip is null

@anilyadav

This comment has been minimized.

Copy link

commented Jul 10, 2018

We are running scylla in AWS. AWS concept of Zones .We are using same.

@gleb-cloudius Can we connect over Slack?

@gleb-cloudius

This comment has been minimized.

Copy link
Contributor

commented Jul 12, 2018

@slivne

This comment has been minimized.

Copy link
Contributor Author

commented Aug 19, 2018

@roydahan can you please get someone to help with this being tested.

@avikivity

This comment has been minimized.

Copy link
Contributor

commented Aug 20, 2018

@gleb-cloudius maybe a race between scylla starting up and the snitch filling up the state. Do we wait for the snitch to complete its start-up?

@gleb-cloudius

This comment has been minimized.

Copy link
Contributor

commented Aug 20, 2018

@avikivity

This comment has been minimized.

Copy link
Contributor

commented Aug 20, 2018

Maybe an error happened and the snitch never got initialized.

@gleb-cloudius

This comment has been minimized.

Copy link
Contributor

commented Aug 20, 2018

@tzach tzach modified the milestones: 2.3, 2.4 Aug 28, 2018

@slivne slivne modified the milestones: 3.0, 3.1 Sep 16, 2018

@aleclerc

This comment has been minimized.

Copy link

commented Feb 11, 2019

Hitting similar issues:

Feb 11 12:42:00 ip-172-26-14-141 scylla:  [shard 2] storage_proxy - exception during mutation write to 0.186.102.0: std::system_error (error system:22, Invalid argument)
Feb 11 12:42:00 ip-172-26-14-141 scylla:  [shard 3] storage_proxy - exception during mutation write to 0.156.196.56: std::system_error (error system:22, Invalid argument)
Feb 11 12:42:00 ip-172-26-14-141 scylla:  [shard 3] storage_proxy - exception during mutation write to 0.156.196.56: std::system_error (error system:22, Invalid argument)
Feb 11 12:42:01 ip-172-26-14-141 scylla:  [shard 0] storage_proxy - exception during mutation write to 0.0.96.0: std::system_error (error system:22, Invalid argument)

Only happening on one of my nodes. Single DC on AWS.

@amoskong

This comment has been minimized.

Copy link
Contributor

commented Feb 12, 2019

@gleb-cloudius

This comment has been minimized.

Copy link
Contributor

commented Feb 12, 2019

@amoskong Over slack we got that it is EC2Snitch and nodetool status output is:

nodetool status
Datacenter: us-east
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns    Host ID                               Rack
UN  172.26.12.153  9.69 GB    256          ?       0db44185-0316-41c3-a199-e215fcf6f326  1a
UN  172.26.12.220  8.99 GB    256          ?       01408f07-f58c-4837-b99b-73f16f85f06c  1a
UN  172.26.14.141  10.26 GB   256          ?       07fe017f-a382-47a7-aa56-ce59c8724184  1c
UN  172.26.13.231  9.4 GB     256          ?       d9264a66-3049-412e-bfa1-242f91ecbe1c  1b

Three racks + EC2Snitch again like in all other cases, but at this point I think it is just an accident.

@amoskong

This comment has been minimized.

Copy link
Contributor

commented Feb 19, 2019

I'm running our short longevity (3 hours) with random nemesis, but still not reproduce the problem (ip addresses are correct).

Scylla version: 3.0.3-0.20190213.873e0f0e1

And the error with 'rpc - client $ip_address' isn't same as this issue. We should find a way to reproduce the same problem first (storage_proxy - exception during mutation write to xxx, rpc - client xxxxx: fail to connect: Connection timed out)

Two jobs are still running, I will update final result later.

Test with public IP

[centos@ip-172-31-8-144 ~]$ nodetool status
Datacenter: eu-westscylla_node_west
===================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns    Host ID                               Rack
UN  54.194.19.179  3.04 GB    256          ?       fc9e6ffe-cdfb-4d5b-8cf6-09ac4472ac4a  1a
UN  63.33.61.133   2.39 GB    256          ?       a00bb4ef-a3b1-463f-9111-2b71c4e55786  1c
UN  52.215.255.17  2.12 GB    256          ?       05437a62-96d2-442b-8f5b-ecaa2b3bdb8a  1b
UN  34.250.64.0    1.68 GB    256          ?       c9b66ebb-0df1-4d97-88bc-7ff3301dda45  1a
UN  34.247.54.252  2.35 GB    256          ?       b069007c-c771-4410-ba2d-581e68d2ac1f  1b
UN  34.252.231.49  3.12 GB    256          ?       bd85c91a-e144-4b77-9d30-8bb1737a32e8  1c

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless

---

2019-02-19 09:46:29,100 remote           L0194 DEBUG| [34.247.54.252] [stdout] Feb 19 09:46:29 info   |  [shard 0] rpc - client 34.252.231.49: fail to connect: Connection refused
2019-02-19 10:07:21,267 remote           L0194 DEBUG| [34.252.231.49] [stdout] Feb 19 10:07:21 info   |  [shard 0] rpc - client 34.247.54.252: fail to connect: Connection refused
2019-02-19 10:36:28,575 remote           L0194 DEBUG| [34.252.231.49] [stdout] Feb 19 10:36:28 info   |  [shard 0] rpc - client 52.215.255.17: client connection dropped: read: Connection reset by peer
2019-02-19 10:37:26,403 remote           L0194 DEBUG| [34.247.54.252] [stdout] Feb 19 10:37:26 info   |  [shard 0] rpc - client 52.215.255.17: fail to connect: Connection reset by peer
2019-02-19 09:46:20,829 remote           L0194 DEBUG| [34.252.231.49] [stdout] Feb 19 09:46:20 info   |  [shard 0] rpc - client 63.33.61.133: server connection dropped: Semaphore broken
2019-02-19 09:46:32,586 remote           L0194 DEBUG| [52.215.255.17] [stdout] Feb 19 09:46:32 info   |  [shard 0] rpc - client 34.252.231.49: client connection dropped: connection is closed

Test with private IP

$ nodetool status

Datacenter: eu-westscylla_node_west
===================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns    Host ID                               Rack
UN  172.31.13.27   538.09 KB  256          ?       8918e708-3b74-439f-a6f7-c1f88af734ea  1a
UN  172.31.31.121  538.06 KB  256          ?       843db410-992d-4731-90ec-b84e57805a8b  1b
UN  172.31.41.151  537.7 KB   256          ?       4c945990-a382-4f38-b936-8d48813c9961  1c
UN  172.31.15.147  537.75 KB  256          ?       126c8120-9ca5-4958-a6df-379f03e106b1  1a
UN  172.31.34.98   537.72 KB  256          ?       fe80661c-c113-4c76-af66-0fa315380c42  1c
UN  172.31.24.1    515.9 KB   256          ?       3e7732a2-b12d-4ddb-8db5-276fc7e0244d  1b

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless

2019-02-19 10:41:13,421 remote           L0194 DEBUG| [172.31.24.1] [stdout] Feb 19 10:41:13 info   |  [shard 0] rpc - client 172.31.34.98: client connection dropped: read: Connection reset by peer
2019-02-19 11:04:28,473 remote           L0194 DEBUG| [172.31.24.1] [stdout] Feb 19 11:04:28 info   |  [shard 2] rpc - client 172.31.34.98: server connection dropped: Semaphore broken
2019-02-19 11:04:51,291 remote           L0194 DEBUG| [172.31.34.98] [stdout] Feb 19 11:04:51 info   |  [shard 1] rpc - client 172.31.24.1: client connection dropped: connection is closed
@amoskong

This comment has been minimized.

Copy link
Contributor

commented Feb 19, 2019

In Public IP testing, there is an IP in rpc client error, which isn't cluster node. wrong protocol magic | server connection dropped: connection is close

2019-02-19 12:05:38,736 nemesis
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns    Host ID                               Rack
UN  63.33.61.133   4.39 GB    256          ?       a00bb4ef-a3b1-463f-9111-2b71c4e55786  1c
UN  52.215.255.17  4.49 GB    256          ?       05437a62-96d2-442b-8f5b-ecaa2b3bdb8a  1b
UN  34.250.64.0    3.92 GB    256          ?       c9b66ebb-0df1-4d97-88bc-7ff3301dda45  1a
UN  34.245.137.78  3.69 GB    256          ?       44a7448d-308a-49a4-a5f9-9aa7a2a73d5f  1a
UN  34.247.54.252  4.07 GB    256          ?       b069007c-c771-4410-ba2d-581e68d2ac1f  1b
UN  34.252.231.49  4.35 GB    256          ?       bd85c91a-e144-4b77-9d30-8bb1737a32e8  1c

2019-02-19 12:07:23,683 remote           L0194 DEBUG| [34.247.54.252] [stdout] Feb 19 12:07:23 info   |  [shard 2] rpc - client 106.75.64.59: wrong protocol magic
2019-02-19 12:07:23,683 remote           L0194 DEBUG| [34.247.54.252] [stdout] Feb 19 12:07:23 info   |  [shard 2] rpc - client 106.75.64.59: server connection dropped: connection is closed
2019-02-19 12:07:24,051 remote           L0194 DEBUG| [34.247.54.252] [stdout] Feb 19 12:07:24 info   |  [shard 2] rpc - client 106.75.64.59: wrong protocol magic
2019-02-19 12:07:24,051 remote           L0194 DEBUG| [34.247.54.252] [stdout] Feb 19 12:07:24 info   |  [shard 2] rpc - client 106.75.64.59: server connection dropped: connection is closed
2019-02-19 12:07:29,236 remote           L0194 DEBUG| [34.247.54.252] [stdout] Feb 19 12:07:29 info   |  [shard 2] rpc - client 106.75.64.59: server connection dropped: connection is closed
2019-02-19 12:07:29,424 remote           L0194 DEBUG| [34.247.54.252] [stdout] Feb 19 12:07:29 info   |  [shard 2] rpc - client 106.75.64.59: wrong protocol magic
2019-02-19 12:07:29,424 remote           L0194 DEBUG| [34.247.54.252] [stdout] Feb 19 12:07:29 info   |  [shard 2] rpc - client 106.75.64.59: server connection dropped: connection is closed
2019-02-19 12:07:29,797 remote           L0194 DEBUG| [34.247.54.252] [stdout] Feb 19 12:07:29 info   |  [shard 2] rpc - client 106.75.64.59: wrong protocol magic
2019-02-19 12:07:29,798 remote           L0194 DEBUG| [34.247.54.252] [stdout] Feb 19 12:07:29 info   |  [shard 2] rpc - client 106.75.64.59: server connection dropped: connection is closed


2019-02-19 12:10:44,773 remote           L0194 DEBUG| [34.250.64.0] [stdout] --  Address        Load       Tokens       Owns    Host ID                               Rack
2019-02-19 12:10:44,791 remote           L0194 DEBUG| [34.250.64.0] [stdout] UN  63.33.61.133   4.22 GB    256          ?       a00bb4ef-a3b1-463f-9111-2b71c4e55786  1c
2019-02-19 12:10:44,802 remote           L0194 DEBUG| [34.250.64.0] [stdout] UN  52.215.255.17  4.82 GB    256          ?       05437a62-96d2-442b-8f5b-ecaa2b3bdb8a  1b
2019-02-19 12:10:44,812 remote           L0194 DEBUG| [34.250.64.0] [stdout] UN  34.250.64.0    3.83 GB    256          ?       c9b66ebb-0df1-4d97-88bc-7ff3301dda45  1a
2019-02-19 12:10:44,823 remote           L0194 DEBUG| [34.250.64.0] [stdout] UN  34.245.137.78  2.67 GB    256          ?       44a7448d-308a-49a4-a5f9-9aa7a2a73d5f  1a
2019-02-19 12:10:44,833 remote           L0194 DEBUG| [34.250.64.0] [stdout] UN  34.247.54.252  3.71 GB    256          ?       b069007c-c771-4410-ba2d-581e68d2ac1f  1b
2019-02-19 12:10:44,843 remote           L0194 DEBUG| [34.250.64.0] [stdout] UN  34.252.231.49  4.25 GB    256          ?       bd85c91a-e144-4b77-9d30-8bb1737a32e8  1c
@gleb-cloudius

This comment has been minimized.

Copy link
Contributor

commented Feb 19, 2019

@amoskong

This comment has been minimized.

Copy link
Contributor

commented Feb 20, 2019

@gleb-cloudius

This comment has been minimized.

Copy link
Contributor

commented May 1, 2019

I managed to reproduce it finally. To reproduce create a setup with 4 nodes in the same DC but each in its own rack. Run logged batch transaction against the cluster. Reproduction depends on how unordered_multimap divides elements to buckets, but setup described above worked for me.

penberg added a commit that referenced this issue May 3, 2019

batchlog_manager: fix array out of bound access
endpoint_filter() function assumes that each bucket of
std::unordered_multimap contains elements with the same key only, so
its size can be used to know how many elements with a particular key
are there.  But this is not the case, elements with multiple keys may
share a bucket. Fix it by counting keys in other way.

Fixes #3229

Message-Id: <20190501133127.GE21208@scylladb.com>
(cherry picked from commit 95c6d19)

penberg added a commit that referenced this issue May 3, 2019

batchlog_manager: fix array out of bound access
endpoint_filter() function assumes that each bucket of
std::unordered_multimap contains elements with the same key only, so
its size can be used to know how many elements with a particular key
are there.  But this is not the case, elements with multiple keys may
share a bucket. Fix it by counting keys in other way.

Fixes #3229

Message-Id: <20190501133127.GE21208@scylladb.com>
(cherry picked from commit 95c6d19)
@amoskong

This comment has been minimized.

Copy link
Contributor

commented May 17, 2019

Hi Gleb,

I still didn't reproduce the issue, can you help to check the test cluster configuration & two test method of logged batch transaction?

$ scylla --version
3.0.3-0.20190213.873e0f0e1

$ cat /etc/scylla/scylla.yaml|grep endpoint
endpoint_snitch: "Ec2Snitch"

$ nodetool status

Datacenter: eu-westscylla_node_west
===================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns    Host ID                               Rack
UN  172.31.0.111   3.82 GB    256          ?       4b47558d-af16-4a7f-9fa4-597b4b6a7401  1a
UN  172.31.30.199  4 GB       256          ?       ef01aa4f-54de-42c7-a5e3-afffe6fba47a  1b
UN  172.31.15.182  3.86 GB    256          ?       4a918b84-d5ea-4866-b285-9c7c02f5abd8  1a
UN  172.31.40.198  2.05 GB    256          ?       721f048e-ed89-419f-b093-60f247a69905  1c
UN  172.31.37.85   3.72 GB    256          ?       3f3d428b-42a3-4576-b5fb-2efba3648d0f  1c
UN  172.31.21.162  2.5 GB     256          ?       2099a605-be93-4271-81ae-188bf65970e6  1b

image

Test 1:

cqlsh:keyspace1> CREATE TABLE users (
             ...     id int,
             ...     firstname text,
             ...     lastname text,
             ...     PRIMARY KEY (id)
             ... );
cqlsh:keyspace1> 
cqlsh:keyspace1> BEGIN BATCH USING TIMESTAMP 1111111111111111
             ... INSERT INTO users (id, firstname, lastname) VALUES (0, 'Jack', 'Sparrow')
             ... INSERT INTO users (id, firstname, lastname) VALUES (1, 'Will', 'Turner')
             ... APPLY BATCH;
cqlsh:keyspace1> 
cqlsh:keyspace1> SELECT id, firstname, lastname, writetime(firstname), writetime(lastname) FROM users;

 id | firstname | lastname | writetime(firstname) | writetime(lastname)
----+-----------+----------+----------------------+---------------------
  1 |      Will |   Turner |     1111111111111111 |    1111111111111111
  0 |      Jack |  Sparrow |     1111111111111111 |    1111111111111111

(2 rows)

Test 2:

I tested with c-s profile workload (Insert batchtype is LOGGED), c-s worked well, and didn't found wrong ip in db log.

$ cassandra-stress user no-warmup profile=/tmp/complex_schema.yaml  ops'(insert=1)' cl=ALL n=5000000 -mode cql3 native -rate threads=1000 -pop seq
=1..5000000 -node 172.31.0.111
....
  Extra Schema Definitions: [CREATE TYPE IF NOT EXISTS hash_and_basename (md5 text, basename text);, CREATE TYPE IF NOT EXISTS address (street text, city text, zip_code int, phones set<text>, map1 map<text,text>, map2 map<int,text>);]
  Generator Configs:
    key: Size: Fixed:  key=10;
  Query Definitions:
    read1: CQL:select * from user_with_ck where key = ? LIMIT 1;Fields:samerow;
    read2: CQL:select * from user_with_ck where key = ? and md5 = ? LIMIT 1;Fields:samerow;
    update_static: CQL:update user_with_ck USING TTL 5 set static_int = ? where key = ?;Fields:samerow;
    update_ttl: CQL:update user_with_ck USING TTL 5 set check_date = ? where key = ? and md5 = ?;Fields:samerow;
    update_diff1_ts: CQL:update user_with_ck USING TIMESTAMP 10 set check_date = ? where key = ? and md5 = ?;Fields:samerow;
    update_diff2_ts: CQL:update user_with_ck USING TIMESTAMP 5 set pass_date = ? where key = ? and md5 = ?;Fields:samerow;
    update_same1_ts: CQL:update user_with_ck USING TIMESTAMP 10 set check_date = '2018-01-01T11:21:59.001+0000' where key = ? and md5 = ?;Fields:samerow;
    update_same2_ts: CQL:update user_with_ck USING TIMESTAMP 10 set pass_date = '2018-01-01T11:21:59.001+0000' where key = ? and md5 = ?;Fields:samerow;
    alter_table: CQL:ALTER TABLE user_with_ck WITH comment = 'updated';Fields:samerow;
    delete_row: CQL:delete from user_with_ck where key = ? and md5 = ?;Fields:samerow;
  Token Range Queries:
  Insert Settings:
    partitions: fixed(1)
    select: fixed(1)/1000
    batchtype: LOGGED
Connected to cluster: longevity-10gb-3h-3.0-db-cluster-bd4a37fe, max pending requests per connection 128, max connections per host 8
Datatacenter: eu-westscylla_node_west; Host: /172.31.30.199; Rack: 1b
Datatacenter: eu-westscylla_node_west; Host: /172.31.15.182; Rack: 1a
Datatacenter: eu-westscylla_node_west; Host: /172.31.40.198; Rack: 1c
Datatacenter: eu-westscylla_node_west; Host: /172.31.37.85; Rack: 1c
Datatacenter: eu-westscylla_node_west; Host: /172.31.21.162; Rack: 1b
Datatacenter: eu-westscylla_node_west; Host: /172.31.0.111; Rack: 1a
Created schema. Sleeping 1s for propagation.
Created extra schema. Sleeping 1s for propagation.
Generating batches with [1..1] partitions and [0..0] rows (of [1..1] total rows in the partitions)

Sleeping 2s...
Running [insert] with 1000 threads for 5000000 iteration
Failed to connect over JMX; not collecting these stats
type       total ops,    op/s,    pk/s,   row/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr, errors,  gc: #,  max ms,  sum ms,  sdv ms,      mb
total,           327,     327,     327,     327,   618.6,   585.6,  1304.4,  1366.3,  1374.7,  1374.7,    2.0, -0.70711,      0,      0,       0,       0,       0,       0
total,          1976,    1649,    1649,    1649,   600.1,   527.4,  1461.7,  1815.1,  2258.6,  2283.8,    3.0, -0.42998,      0,      0,       0,       0,       0,       0
@amoskong

This comment has been minimized.

Copy link
Contributor

commented May 17, 2019

Test 2:

I tested with c-s profile workload (Insert batchtype is LOGGED), c-s worked well, and didn't found wrong ip in db log.

$ cassandra-stress user no-warmup profile=/tmp/complex_schema.yaml  ops'(insert=1)' cl=ALL n=5000000 -mode cql3 native -rate threads=1000 -pop seq
=1..5000000 -node 172.31.0.111
....

Insert Settings:
partitions: fixed(1)

I successfully reproduced the issue by change partitions from fixed(1) to fixed(2)

select: fixed(1)/1000
batchtype: LOGGED
@gleb-cloudius

This comment has been minimized.

Copy link
Contributor

commented May 19, 2019

@amoskong

This comment has been minimized.

Copy link
Contributor

commented May 20, 2019

On Fri, May 17, 2019 at 06:00:20PM +0000, Amos Kong wrote: > #### Test 2: > I tested with c-s profile workload (Insert batchtype is LOGGED), c-s worked well, and didn't found wrong ip in db log. > > ``` > $ cassandra-stress user no-warmup profile=/tmp/complex_schema.yaml ops'(insert=1)' cl=ALL n=5000000 -mode cql3 native -rate threads=1000 -pop seq > =1..5000000 -node 172.31.0.111 > .... > Insert Settings: > partitions: fixed(1) I successfully reproduced the issue by change partitions from fixed(1) to `fixed(2)`
A batch with only one partition will not be treated as a batch.

Thanks.


-- Gleb.

I tried to write a reproducer in dtest, Ec2Snitch can't be used in dtest, then I tried with GossipingPropertyFileSnitch, use same workload which I can reproduce the issue.

But the workload works well in dtest, and the issue isn't occurred (666.development-0.20190416.52e1125b5, which doesn't contain your fix.)

   def test_reproduce_issue3229(self):
        cluster = self.cluster
        cluster.populate([6])
        #cluster.set_configuration_options(values={'endpoint_snitch': 'org.apache.cassandra.locator.GossipingPropertyFileSnitch'})
        cluster.set_configuration_options(values={'endpoint_snitch': 'org.apache.cassandra.locator.GossipingPropertyFileSnitch'})

        for i, node in enumerate(cluster.nodelist()):
            with open(os.path.join(node.get_conf_dir(), 'cassandra-rackdc.properties'), 'w') as snitch_file:
                for line in ["dc={}".format(node.data_center), "rack=rack{}".format(i % 3)]:
                    snitch_file.write(line + os.linesep)

        cluster.start(wait_for_binary_proto=True)
        # execute workload ....
        # cassandra-stress user no-warmup profile=/tmp/complex_schema.yaml  ops'(insert=1)' cl=ALL n=5000000 -mode cql3 native -rate threads=100 -pop seq=1..5000000 -node 127.0.0.1
       # insert:
       #     partitions: fixed(2)
       #     select:    fixed(1)/1000
       #     batchtype: LOGGED
@gleb-cloudius

This comment has been minimized.

Copy link
Contributor

commented May 20, 2019

@amoskong

This comment has been minimized.

Copy link
Contributor

commented May 20, 2019

I tried different kind of condition, it also can't be reproduced with 4 nodes (and each is in its own rack).

My reproducer: https://github.com/amoskong/scylla-dtest/commits/issue_3229_v1

$ cat reproducer_issue3229.py 
from dtest import Tester
import os

class Reproducer(Tester):

    def test_issue3229(self):
        cluster = self.cluster
        cluster.populate([4])
        cluster.set_configuration_options(values={'endpoint_snitch': 'org.apache.cassandra.locator.GossipingPropertyFileSnitch'})

        for i, node in enumerate(cluster.nodelist()):
            with open(os.path.join(node.get_conf_dir(), 'cassandra-rackdc.properties'), 'w') as snitch_file:
                for line in ["dc={}".format(node.data_center), "rack=rack{}".format(i % 4)]:
                    snitch_file.write(line + os.linesep)

        cluster.start(wait_for_binary_proto=True)

        node.stress(["user", "no-warmup", "profile=/tmp/complex_schema.yaml", "ops(insert=1)", "cl=ALL", "n=5000000", "-mode", "cql3", "native", "-rate", "threads=100", "-pop", "seq=1..5000000"])
@gleb-cloudius

This comment has been minimized.

Copy link
Contributor

commented May 20, 2019

To reproduce the bug we depend on how hash table hashes its elements and this depends on an implementation and the elements itself. In my test I named racks as "rc, rc2...". Those names are the keys into the hash table. Try to name them the same.

@amoskong

This comment has been minimized.

Copy link
Contributor

commented May 20, 2019

@gleb-cloudius

This comment has been minimized.

Copy link
Contributor

commented May 20, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.