Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

longevity 5000 tables in on keyspace, streaming failed (Cannot assign requested address) #4943

Closed
fruch opened this issue Sep 2, 2019 · 34 comments

Comments

@fruch
Copy link

@fruch fruch commented Sep 2, 2019

This is Scylla's bug tracker, to be used for reporting bugs only.
If you have a question about Scylla, and not a bug, please ask it in
our mailing-list at scylladb-dev@googlegroups.com or in our slack channel.

  • I have read the disclaimer above, and I am reporting a suspected malfunction in Scylla.

Installation details

Scylla version (or git commit hash): 3.1.0.rc4-0.20190830.d70c2db09
Cluster size: 2
OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-04b9560705a3975d1

logs

gs://scratch.scylladb.com/ifruchte/5000ks_stream_error_node1.log
gs://scratch.scylladb.com/ifruchte/5000ks_stream_error_node2.log

summery

after creating 5000 tables in one keyspace on node1,
node2 two was added to the cluster and started to stream from node1.

node failed on stream in keyspace=feeds
node1:

2019-09-02T13:43:53+00:00  ip-10-0-136-244 !WARNING | scylla: [shard 0] stream_session - [Stream #aed42ad0-cd87-11e9-8cb8-000000000005] stream_transfer_task: Fail to send to 10.0.68.176:0: std::system_error (error system:99, connect: Cannot assign requested address)
2019-09-02T13:43:53+00:00  ip-10-0-136-244 !WARNING | scylla: [shard 0] stream_session - [Stream #aed42ad0-cd87-11e9-8cb8-000000000005] Failed to send: std::system_error (error system:99, connect: Cannot assign requested address)
2019-09-02T13:43:53+00:00  ip-10-0-136-244 !WARNING | scylla: [shard 0] stream_session - [Stream #aed42ad0-cd87-11e9-8cb8-000000000005] Streaming error occurred, peer=10.0.68.176
2019-09-02T13:43:53+00:00  ip-10-0-136-244 !WARNING | scylla: [shard 0] stream_session - [Stream #aed42ad0-cd87-11e9-8cb8-000000000005] Streaming plan for Bootstrap-feeds-index-0 failed, peers={10.0.68.176}, tx=0 KiB, 0.00 KiB/s, rx=0 KiB, 0.00 KiB/s

node2:

2019-09-02T13:35:02+00:00  ip-10-0-68-176 !INFO    | scylla: [shard 0] range_streamer - Bootstrap with 10.0.136.244 for keyspace=feeds, 51 out of 513 ranges: ranges = 51
2019-09-02T13:35:02+00:00  ip-10-0-68-176 !INFO    | scylla: [shard 0] stream_session - [Stream #773da070-cd86-11e9-8cb8-000000000005] Executing streaming plan for Bootstrap-feeds-index-0 with peers={10.0.136.244}, master
2019-09-02T13:35:10+00:00  ip-10-0-68-176 !INFO    | scylla: [shard 0] stream_session - [Stream #773da070-cd86-11e9-8cb8-000000000005] Received failed complete message, peer=10.0.136.244
2019-09-02T13:35:10+00:00  ip-10-0-68-176 !WARNING | scylla: [shard 0] stream_session - [Stream #773da070-cd86-11e9-8cb8-000000000005] Streaming plan for Bootstrap-feeds-index-0 failed, peers={10.0.136.244}, tx=0 KiB, 0.00 KiB/s, rx=0 KiB, 0.00 KiB/s
2019-09-02T13:35:10+00:00  ip-10-0-68-176 !WARNING | scylla: [shard 0] range_streamer - Bootstrap with 10.0.136.244 for keyspace=feeds failed, took 8.11 seconds: streaming::stream_exception (Stream failed)
2019-09-02T13:35:10+00:00  ip-10-0-68-176 !WARNING | scylla: [shard 0] range_streamer - Bootstrap failed, took 8 seconds, nr_ranges_remaining=513
2019-09-02T13:35:10+00:00  ip-10-0-68-176 !WARNING | scylla: [shard 0] range_streamer - Bootstrap failed to stream. Will retry in 60 seconds ...

nodes 2 retries again, but keep failing.

@slivne

This comment has been minimized.

Copy link
Contributor

@slivne slivne commented Sep 2, 2019

is this still up

can you run netstat -an on both machines

@fruch

This comment has been minimized.

Copy link
Author

@fruch fruch commented Sep 3, 2019

@slivne happen again on a new cluster,

I assumed you care only on the tcp

node1

[centos@ip-10-0-203-102 ~]$ netstat -an                                                                                                                                                                        
Active Internet connections (servers and established)                                                                                                                                                          
Proto Recv-Q Send-Q Local Address           Foreign Address         State                                                                                                                                      
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN                                                                                                                                     
tcp        0      0 127.0.0.1:10000         0.0.0.0:*               LISTEN                                                                                                                                     
tcp        0      0 0.0.0.0:9042            0.0.0.0:*               LISTEN                                                                                                                                     
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN                                                                                                                                     
tcp        0      0 10.0.203.102:7000       0.0.0.0:*               LISTEN                                                                                                                                     
tcp        0      0 127.0.0.1:25            0.0.0.0:*               LISTEN                                                                                                                                     
tcp        0      0 0.0.0.0:9180            0.0.0.0:*               LISTEN                                                                                                                                     
tcp        0      0 0.0.0.0:9160            0.0.0.0:*               LISTEN                                                                                                                                     
tcp        0      0 10.0.203.102:9180       10.0.79.96:48276        ESTABLISHED                                                                                                                                
tcp        0    208 10.0.203.102:22         199.203.229.89:43580    ESTABLISHED                                                                                                                                
tcp        1      0 10.0.203.102:62608      169.254.169.254:80      CLOSE_WAIT                                                                                                                                 
tcp6       0      0 :::45839                :::*                    LISTEN                                                                                                                                     
tcp6       0      0 :::111                  :::*                    LISTEN                                                                                                                                     
tcp6       0      0 :::22                   :::*                    LISTEN                                                                                                                                     
tcp6       0      0 ::1:25                  :::*                    LISTEN                                                                                                                                     
tcp6       0      0 127.0.0.1:7199          :::*                    LISTEN                                                                                                                                     
tcp6       0      0 :::9100                 :::*                    LISTEN                                                                                                                                     
tcp6       0      0 127.0.0.1:52484         127.0.0.1:10000         TIME_WAIT                                                                                                                                  
tcp6       0      0 10.0.203.102:9100       10.0.79.96:45300        ESTABLISHED                                                                                                                                
udp        0      0 0.0.0.0:68              0.0.0.0:*                                                                                                                                                          
udp        0      0 0.0.0.0:111             0.0.0.0:*                                                                                                                                                          
udp        0      0 127.0.0.1:323           0.0.0.0:*                                                                                                                                                          
udp6       0      0 :::111                  :::*                                                                                                                                                               
udp6       0      0 ::1:323                 :::*                                               

node2

[centos@ip-10-0-176-25 ~]$ netstat -an                                                                                                                                                                         
Active Internet connections (servers and established)                                                                                                                                                          
Proto Recv-Q Send-Q Local Address           Foreign Address         State                                                                                                                                      
tcp        0      0 127.0.0.1:25            0.0.0.0:*               LISTEN                                                                                                                                     
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN                                                                                                                                     
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN                                                                                                                                     
tcp        0    208 10.0.176.25:22          199.203.229.89:38208    ESTABLISHED                                                                                                                                
tcp6       0      0 ::1:25                  :::*                    LISTEN                                                                                                                                     
tcp6       0      0 127.0.0.1:7199          :::*                    LISTEN                                                                                                                                     
tcp6       0      0 :::9100                 :::*                    LISTEN                                                                                                                                     
tcp6       0      0 :::36975                :::*                    LISTEN                                                                                                                                     
tcp6       0      0 :::111                  :::*                    LISTEN                                                                                                                                     
tcp6       0      0 :::22                   :::*                    LISTEN                                                                                                                                     
tcp6       0      0 10.0.176.25:9100        10.0.79.96:46768        ESTABLISHED                                                                                                                                
udp        0      0 0.0.0.0:68              0.0.0.0:*                                                                                                                                                          
udp        0      0 0.0.0.0:111             0.0.0.0:*                                                                                                                                                          
udp        0      0 127.0.0.1:323           0.0.0.0:*                                                                                                                                                          
udp6       0      0 :::111                  :::*                                                                                                                                                               
udp6       0      0 ::1:323                 :::*                                
@slivne slivne added this to the 3.1 milestone Sep 3, 2019
@gleb-cloudius

This comment has been minimized.

Copy link
Contributor

@gleb-cloudius gleb-cloudius commented Sep 3, 2019

@fruch

This comment has been minimized.

Copy link
Author

@fruch fruch commented Sep 3, 2019

following @gleb-cloudius advice

I've restart the failing node (node2), and run the following on node1 while the problem happening

[centos@ip-10-0-203-102 ~]$ netstat -an  > netstat.log
[centos@ip-10-0-203-102 ~]$ vi netstat.log
[centos@ip-10-0-203-102 ~]$ grep TIME_WAIT  netstat.log   |  wc -l
15069

there are more then 15k line like that:

tcp        0      0 10.0.203.102:55892      10.0.176.25:7000        TIME_WAIT

the full netstat log:
http://scratch.scylladb.com/fruch/4943_netstat.log

@gleb-cloudius

This comment has been minimized.

Copy link
Contributor

@gleb-cloudius gleb-cloudius commented Sep 3, 2019

@asias why do we open so many streams? Didn't you fix it?

@slivne

This comment has been minimized.

Copy link
Contributor

@slivne slivne commented Sep 4, 2019

@asias ping

@slivne

This comment has been minimized.

Copy link
Contributor

@slivne slivne commented Sep 4, 2019

asias's patch for not opening connections on all cores 0e6b622 is merged into 3.1 - so we are not missing this backport

@asias

This comment has been minimized.

Copy link
Contributor

@asias asias commented Sep 5, 2019

From start to bootstrap node2 until the first failure of streaming on node1 (the sender)

There are only 2684 stream plan connects from node1 to node2 on all shards, 1452 out of the 2684 are the connections for the keyspace feeds which has 5000 tables.

[asias@hjpc2 issue.4943]$ cat stream.first.txt |grep 'Start send'|wc -l
2684
[asias@hjpc2 issue.4943]$ cat stream.first.txt |grep 'Start send'|grep ks=feeds|wc -l
1452

This is not supposed to exhaust the ports on node1 by itself. There must be something else that is not streaming to use the ports, prometheus server?

asias added a commit to asias/scylla that referenced this issue Sep 5, 2019
We can use the reader:peek() to check if the reader contains any data.
If not, do not open the rpc stream connection. It helps to reduce the
port usage.

Refs: scylladb#4943
asias added a commit to asias/scylla that referenced this issue Sep 5, 2019
We can use the reader::peek() to check if the reader contains any data.
If not, do not open the rpc stream connection. It helps to reduce the
port usage.

Refs: scylladb#4943
@asias

This comment has been minimized.

Copy link
Contributor

@asias asias commented Sep 5, 2019

I created #4968 to open the connection if there is data in the reader, in addition to what we had in the past: to open the connection if the range is relevant for the shard.

But I think we have other issues that exhaust the ports beyond streaming.

@slivne

This comment has been minimized.

Copy link
Contributor

@slivne slivne commented Sep 5, 2019

@slivne try to reproduce outside of sct

@slivne slivne self-assigned this Sep 5, 2019
@slivne

This comment has been minimized.

Copy link
Contributor

@slivne slivne commented Sep 5, 2019

easy way to reproduce

create a single node with 500 empty tables using a large machine (i3.16x)

schema.txt

add a new node / nodetool rebuild it and you will get

on 3.1.0.rc5

[shard 0] stream_session - [Stream #ae8b4c10-d026-11e9-b650-000000000009] Streaming plan for Rebuild-keyspace1-index-1 failed, peers={10.0.0.140}, tx=0 Ki
B, 0.00 KiB/s, rx=0 KiB, 0.00 KiB/s
[shard 0] stream_session - [Stream #ae8b4c10-d026-11e9-b650-000000000009] Streaming error occurred, peer=10.0.0.140
[shard 0] stream_session - [Stream #ae8b4c10-d026-11e9-b650-000000000009] Failed to send: std::system_error (error system:99, connect: Cannot assign requested address)
[shard 0] stream_session - [Stream #ae8b4c10-d026-11e9-b650-000000000009] stream_transfer_task: Fail to send to 10.0.0.140:0: std::system_error (error system:99, connect: Cannot assign requested address) 
[centos@ip-10-0-0-68 ~]$ netstat -an | grep TIME_WAIT | wc -l
32697

on 3.0.10

Sep 05 21:52:51 ip-10-0-0-13.ec2.internal scylla[3277]:  [shard 4] stream_session - [Stream #7fc3c870-d027-11e9-81c0-000000000004] Streaming plan for Rebuild-keyspace1-index-0 failed, peers={10.0.0.160}, tx=0 KiB, 0.00 KiB/s, rx=0 KiB, 0.00 KiB/s
Sep 05 21:52:51 ip-10-0-0-13.ec2.internal scylla[3277]:  [shard 4] stream_session - [Stream #7fc3c870-d027-11e9-81c0-000000000004] Streaming error occurred, peer=10.0.0.160
Sep 05 21:52:51 ip-10-0-0-13.ec2.internal scylla[3277]:  [shard 4] stream_session - [Stream #7fc3c870-d027-11e9-81c0-000000000004] Failed to send: std::system_error (error system:98, Address already in use)
Sep 05 21:52:51 ip-10-0-0-13.ec2.internal scylla[3277]:  [shard 4] stream_session - [Stream #7fc3c870-d027-11e9-81c0-000000000004] stream_transfer_task: Fail to send to 10.0.0.160:0: std::system_error (error system:98, Address already in use) 
[centos@ip-10-0-0-13 ~]$ netstat -an | grep TIME | wc -l
28102

so this is related to the new streaming that opens connections for every token range - once its closed - this is kept around in TIME_WAIT state and as we can see it causes issue as we run out of ports we have been discussing reusing streaming connections and not disposing them in the rpc layer - not sure we have a real choice but togo ahead and do that.

@gleb-cloudius / @avikivity / @asias - thoughts ?

@asias

@slivne

This comment has been minimized.

Copy link
Contributor

@slivne slivne commented Sep 5, 2019

@roydahan / @fruch did this really work for you guys on 3.0.X ?

@fruch

This comment has been minimized.

Copy link
Author

@fruch fruch commented Sep 5, 2019

We run it on 3.0.8, and it did pass that stage. We managed to add 5 nodes.

@slivne

This comment has been minimized.

Copy link
Contributor

@slivne slivne commented Sep 7, 2019

Well indeed it seems that bootstrap:

  • on 3.0.10 ScyllaDB 3.0.10 (ami-0b88175d56d413224) the bootstrap failes
    Failed to send: std::system_error (error system:98, Address already in use)
  • on 3.0.9 ScyllaDB 3.0.9 (ami-08e617d9f57f477ee) the bootstrap works
  • on 3.0.8 ScyllaDB 3.0.8 (ami-08d3b18bed79ebefc) the bootstrap works

So we have a point to start the search the diff between 3.0.9 ... 3.0.10

The most probable (in my view is cc0b4d2)
@gleb-cloudius I'll need your help with this Asias is out this week

how to reproduce

  1. Boot first node - node1 i3.16X
  2. node1 run cqlsh -f schema.txt
    schema.txt
  3. Boot second node - node2 i3.16x
  4. node2 sudo service scylla-server stop
  5. node2 sudo rm -Rf /var//lib/scylla/data/* /var//lib/scylla/commitlog/*
  6. node2 - change seed to node1
  7. node2 sudo service scylla-server start
@gleb-cloudius

This comment has been minimized.

Copy link
Contributor

@gleb-cloudius gleb-cloudius commented Sep 8, 2019

@slivne can you try those patches:

seastar.txt
scylla.txt

@slivne

This comment has been minimized.

Copy link
Contributor

@slivne slivne commented Sep 8, 2019

tried branches are on
seastar-dev/shlomi/gleb-4943-3-1
seastar-dev/shlomi/gleb-4943-seastar-3-1

rpms wget http://scratch.scylladb.com/shlomi/scylla-3.1.0-gleb-4943.tar

I installed on top of 3.1.0.rc5 AMI
getting

Sep 08 20:44:29 ip-10-0-0-90.ec2.internal scylla[15244]:  [shard 0] stream_session - [Stream #74128f80-d279-11e9-a46a-00000000001a] Failed to send: std::system_error (error system:99, connect: Cannot assign requested address)
 netstat -an | grep TIME | wc -l
32693

@gleb-cloudius - I am not sure the reuseaddr is actually set

@gleb-cloudius

This comment has been minimized.

Copy link
Contributor

@gleb-cloudius gleb-cloudius commented Sep 9, 2019

@gleb-cloudius

This comment has been minimized.

Copy link
Contributor

@gleb-cloudius gleb-cloudius commented Sep 9, 2019

It looks like previous patch sets reuseaddr option on a wrong socket. Can you try this one:

seastar2.txt

@slivne

This comment has been minimized.

Copy link
Contributor

@slivne slivne commented Sep 10, 2019

Installed a new version with this

tried branches are on
seastar-dev/shlomi/gleb-4943-3-1
seastar-dev/shlomi/gleb-4943-seastar-3-1_v2

rpms wget http://scratch.scylladb.com/shlomi/scylla-3.1.0-gleb-4943_v2.tar

The same with this version

Sep 10 07:40:49 ip-10-0-0-90.ec2.internal scylla[64351]:  [shard 0] stream_session - [Stream #4f36b670-d39e-11e9-b3d2-00000000000f] Failed to send: std::system_error (error system:99, connect: Cannot assign requested address)
@slivne

This comment has been minimized.

Copy link
Contributor

@slivne slivne commented Sep 10, 2019

yet it seems that SO_REUSEADDR was set ... the number of ports in TIME_WAIT is much lower

[centos@ip-10-0-0-90 ~]$ netstat -an | grep TIME | wc -l
61

so something else is broken here ... @gleb-cloudius

@gleb-cloudius

This comment has been minimized.

Copy link
Contributor

@gleb-cloudius gleb-cloudius commented Sep 10, 2019

Can you try with this patch on top of previous one?

eaddrnoavail.txt

@gleb-cloudius

This comment has been minimized.

Copy link
Contributor

@gleb-cloudius gleb-cloudius commented Sep 10, 2019

One more on top:

recreate-socket.txt

@slivne

This comment has been minimized.

Copy link
Contributor

@slivne slivne commented Sep 10, 2019

with the last I got

Sep 10 14:27:41 ip-10-0-0-183.ec2.internal scylla[5638]:  [shard 31] rpc - client 10.0.0.90:56575: server connection dropped: connection is closed
Sep 10 14:27:41 ip-10-0-0-183.ec2.internal scylla[5638]:  [shard 0] stream_session - [Stream #255270e0-d3d7-11e9-b8cb-00000000000f] Received failed complete message, peer=10.0.0.90
Sep 10 14:27:41 ip-10-0-0-183.ec2.internal scylla[5638]:  [shard 0] stream_session - [Stream #255270e0-d3d7-11e9-b8cb-00000000000f] Streaming plan for Bootstrap-keyspace1-index-0 failed, peers={10.0.0.90}, tx=0 KiB, 0.00 KiB/s, rx=0 KiB, 0.00 KiB/s
Sep 10 14:27:41 ip-10-0-0-183.ec2.internal scylla[5638]:  [shard 0] range_streamer - Bootstrap with 10.0.0.90 for keyspace=keyspace1 failed, took 0.977 seconds: streaming::stream_exception (Stream failed)
Sep 10 14:27:41 ip-10-0-0-183.ec2.internal scylla[5638]:  [shard 0] range_streamer - Bootstrap failed, took 0 seconds, nr_ranges_remaining=513
Sep 10 14:27:41 ip-10-0-0-183.ec2.internal scylla[5638]:  [shard 0] range_streamer - Bootstrap failed to stream. Will retry in 202 seconds ...

and

Sep 10 14:27:41 ip-10-0-0-90.ec2.internal scylla[710]:  [shard 21] stream_session - [Stream #255270e0-d3d7-11e9-b8cb-00000000000f] Start sending ks=keyspace1, cf=standard_1_43, estimated_partitions=0, with new rpc streaming
Sep 10 14:27:41 ip-10-0-0-90.ec2.internal scylla[710]:  [shard 0] stream_session - [Stream #255270e0-d3d7-11e9-b8cb-00000000000f] stream_transfer_task: Fail to send to 10.0.0.183:0: std::system_error (error system:107, read: Transport endpoint is not connected)
Sep 10 14:27:41 ip-10-0-0-90.ec2.internal scylla[710]:  [shard 0] stream_session - [Stream #255270e0-d3d7-11e9-b8cb-00000000000f] Failed to send: std::system_error (error system:107, read: Transport endpoint is not connected)
Sep 10 14:27:41 ip-10-0-0-90.ec2.internal scylla[710]:  [shard 0] stream_session - [Stream #255270e0-d3d7-11e9-b8cb-00000000000f] Streaming error occurred, peer=10.0.0.183
Sep 10 14:27:41 ip-10-0-0-90.ec2.internal scylla[710]:  [shard 0] stream_session - [Stream #255270e0-d3d7-11e9-b8cb-00000000000f] Streaming plan for Bootstrap-keyspace1-index-0 failed, peers={10.0.0.183}, tx=0 KiB, 0.00 KiB/s, rx=0 KiB, 0.00 KiB/s
^C
@bhalevy bhalevy added the Regression label Sep 11, 2019
@slivne slivne added bug high labels Sep 11, 2019
@slivne

This comment has been minimized.

Copy link
Contributor

@slivne slivne commented Sep 12, 2019

The latest versions are on

tried branches are on
seastar-dev/shlomi/gleb-4943-3-1
seastar-dev/shlomi/gleb-4943-seastar-3-1_v5

the errors I get with these version are

seed node

Sep 12 07:22:43 ip-10-0-0-100.ec2.internal scylla[78515]:  [shard 1] stream_session - [Stream #1c38b0c0-d52e-11e9-816f-00000000000e] Start sending ks=system_traces, cf=node_slow_log_time_idx, estimated_partition
Sep 12 07:22:43 ip-10-0-0-100.ec2.internal scylla[78515]:  [shard 38] rpc - client 10.0.0.119:7000: client stream connection dropped: read: Transport endpoint is not connected
Sep 12 07:22:43 ip-10-0-0-100.ec2.internal scylla[78515]:  [shard 0] stream_session - [Stream #1c38b0c0-d52e-11e9-816f-00000000000e] stream_transfer_task: Fail to send to 10.0.0.119:0: std::system_error (error s
Sep 12 07:22:43 ip-10-0-0-100.ec2.internal scylla[78515]:  [shard 0] stream_session - [Stream #1c38b0c0-d52e-11e9-816f-00000000000e] Failed to send: std::system_error (error system:107, read: Transport endpoint 
Sep 12 07:22:43 ip-10-0-0-100.ec2.internal scylla[78515]:  [shard 0] stream_session - [Stream #1c38b0c0-d52e-11e9-816f-00000000000e] Streaming error occurred, peer=10.0.0.119
Sep 12 07:22:43 ip-10-0-0-100.ec2.internal scylla[78515]:  [shard 0] stream_session - [Stream #1c38b0c0-d52e-11e9-816f-00000000000e] Streaming plan for Bootstrap-system_traces-index-6 failed, peers={10.0.0.119},
Sep 12 07:27:47 ip-10-0-0-100.ec2.internal scylla[78515]:  [shard 0] rpc - client 10.0.0.119:7000: fail to connect: Connection refused

bootstrapping node

Sep 12 07:22:43 ip-10-0-0-119.ec2.internal scylla[71491]:  [shard 0] stream_session - [Stream #1c38b0c0-d52e-11e9-816f-00000000000e] Executing streaming plan for Bootstrap-system_traces-index-6 with peers={10.0.
Sep 12 07:22:43 ip-10-0-0-119.ec2.internal scylla[71491]:  [shard 38] rpc - client 10.0.0.100:57450: server connection dropped: connection is closed
Sep 12 07:22:43 ip-10-0-0-119.ec2.internal scylla[71491]:  [shard 0] stream_session - [Stream #1c38b0c0-d52e-11e9-816f-00000000000e] Received failed complete message, peer=10.0.0.100
Sep 12 07:22:43 ip-10-0-0-119.ec2.internal scylla[71491]:  [shard 0] stream_session - [Stream #1c38b0c0-d52e-11e9-816f-00000000000e] Streaming plan for Bootstrap-system_traces-index-6 failed, peers={10.0.0.100},
Sep 12 07:22:43 ip-10-0-0-119.ec2.internal scylla[71491]:  [shard 0] range_streamer - Bootstrap with 10.0.0.100 for keyspace=system_traces failed, took 0.596 seconds: streaming::stream_exception (Stream failed)
Sep 12 07:22:43 ip-10-0-0-119.ec2.internal scylla[71491]:  [shard 0] range_streamer - Bootstrap failed, took 0 seconds, nr_ranges_remaining=1041
Sep 12 07:22:43 ip-10-0-0-119.ec2.internal scylla[71491]:  [shard 0] range_streamer - Bootstrap failed to stream. Will retry in 303 seconds ...
Sep 12 07:27:46 ip-10-0-0-119.ec2.internal scylla[71491]:  [shard 0] boot_strapper - Error during bootstrap: streaming::stream_exception (Stream failed)

attached are the tcp dums

bootstrap.cap.zip
seed.cap.zip

@slivne slivne removed their assignment Sep 12, 2019
@fruch fruch mentioned this issue Sep 12, 2019
6 of 6 tasks complete
@gleb-cloudius

This comment has been minimized.

Copy link
Contributor

@gleb-cloudius gleb-cloudius commented Sep 12, 2019

Here is one more patch to try on top of them all:

onemore.txt

@slivne

This comment has been minimized.

Copy link
Contributor

@slivne slivne commented Sep 14, 2019

@gleb-cloudius please update the issue with the latest info

@slivne slivne added the Backport 3.0 label Sep 15, 2019
avikivity added a commit to scylladb/seastar that referenced this issue Sep 16, 2019
…ent socket" from Gleb

"
We want to be able to reuse local ports faster in rpc, so add reuseaddr option
and fix some bugs on the way.

Ref scylladb/scylla#4943
"

* 'gleb/reuseaddr-v2' of github.com:cloudius-systems/seastar-dev:
  rpc: add new rpc option to enable local address reuse by rpc clients
  net: add an ability to set reuseaddr option on a client socket
  net: retry connection attempt on EADDRNOTAVAIL
  net: do not retry connection attempt if a requested port cannot be bound
@bhalevy

This comment has been minimized.

Copy link
Contributor

@bhalevy bhalevy commented Sep 19, 2019

@roydahan : @fruch is testing the fix.
@avikivity will backport 73e3d0a once merged to master.

@fruch

This comment has been minimized.

Copy link
Author

@fruch fruch commented Sep 19, 2019

@bhalevy I would test, once all the fix is merged, only the seastar portion of it was merged. waiting for scylla part to get into master

@bhalevy

This comment has been minimized.

Copy link
Contributor

@bhalevy bhalevy commented Sep 22, 2019

@fruch 73e3d0a was merged to master, do you need it backported for starting to test or can you test it provisionally?

@fruch

This comment has been minimized.

Copy link
Author

@fruch fruch commented Sep 22, 2019

I can try the master AMI, if this is needed to be backport to 3.1, that's you call.

@bhalevy

This comment has been minimized.

Copy link
Contributor

@bhalevy bhalevy commented Sep 25, 2019

@avikivity please backport 73e3d0a to 3.1

@fruch

This comment has been minimized.

Copy link
Author

@fruch fruch commented Oct 2, 2019

I've rerun this test case with the master ami, failed with coredump on a different issue but this seem to be fixed.

the new issue raised is #5131

@avikivity

This comment has been minimized.

Copy link
Contributor

@avikivity avikivity commented Oct 2, 2019

I'd like to understand the regression status of this bug. Is it a regression in 3.0? If so which commit regressed it? Is it a regression in 3.1 relative to 3.0?

@slivne

This comment has been minimized.

Copy link
Contributor

@slivne slivne commented Oct 2, 2019

@gleb-cloudius / @asias can you help

AFAIK in 3.0 - this is a regression the source of this is the change that was releases as part of 3.0.9 cc0b4d2

the fix from gleb was
" > Merge "fix some tcp connection bugs and add reuseaddr option to a client socket" from Gleb"

avikivity added a commit to scylladb/scylla-seastar that referenced this issue Oct 3, 2019
…ent socket" from Gleb

"
We want to be able to reuse local ports faster in rpc, so add reuseaddr option
and fix some bugs on the way.

Ref scylladb/scylla#4943
"

* 'gleb/reuseaddr-v2' of github.com:cloudius-systems/seastar-dev:
  rpc: add new rpc option to enable local address reuse by rpc clients
  net: add an ability to set reuseaddr option on a client socket
  net: retry connection attempt on EADDRNOTAVAIL
  net: do not retry connection attempt if a requested port cannot be bound

(cherry picked from commit 5132153)
avikivity added a commit that referenced this issue Oct 3, 2019
* seastar 7dfcf334c4...75488f6ef2 (2):
  > net: socket::{set,get}_reuseaddr() should not be virtual
  > Merge "fix some tcp connection bugs and add reuseaddr option to a client socket" from Gleb

Prerequisite for #4943.
avikivity added a commit that referenced this issue Oct 3, 2019
Fixes #4943

Message-Id: <20190918152405.GV21540@scylladb.com>
(cherry picked from commit 73e3d0a)
avikivity added a commit to scylladb/scylla-seastar that referenced this issue Oct 3, 2019
…ent socket" from Gleb

"
We want to be able to reuse local ports faster in rpc, so add reuseaddr option
and fix some bugs on the way.

Ref scylladb/scylla#4943
"

* 'gleb/reuseaddr-v2' of github.com:cloudius-systems/seastar-dev:
  rpc: add new rpc option to enable local address reuse by rpc clients
  net: add an ability to set reuseaddr option on a client socket
  net: retry connection attempt on EADDRNOTAVAIL
  net: do not retry connection attempt if a requested port cannot be bound

(cherry picked from commit 5132153)
avikivity added a commit that referenced this issue Oct 3, 2019
* seastar af3fc691b9...3920dcb3f8 (2):
  > net: socket::{set,get}_reuseaddr() should not be virtual
  > Merge "fix some tcp connection bugs and add reuseaddr option to a client socket" from Gleb

Prerequisite for #4943.
avikivity added a commit that referenced this issue Oct 3, 2019
Fixes #4943

Message-Id: <20190918152405.GV21540@scylladb.com>
(cherry picked from commit 73e3d0a)
tgrabiec added a commit that referenced this issue Oct 8, 2019
We can use the reader::peek() to check if the reader contains any data.
If not, do not open the rpc stream connection. It helps to reduce the
port usage.

Refs: #4943
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.