Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DPDK mode bugs cassandra-stress #327

Closed
glommer opened this issue Sep 9, 2015 · 16 comments
Closed

DPDK mode bugs cassandra-stress #327

glommer opened this issue Sep 9, 2015 · 16 comments

Comments

@glommer
Copy link
Contributor

glommer commented Sep 9, 2015

When trying to run a high load, with 16 columns on it, cassandra-stress will bail on us, with the following message:

com.datastax.driver.core.exceptions.DriverInternalError: An unexpected protocol error occurred. This is a bug in this library, please report: truncated frame 
com.datastax.driver.core.exceptions.InvalidQueryException: Invalid null value for partition key part name

Despite what the client says, this does not happen with our posix mode, nor with Origin.

Avi suspects we may be closing the connection too soon.

Test setup is intel1 / intel2

DML file is:

#
# Keyspace info
#
keyspace: perf_nocl

#
# The CQL for creating a keyspace (optional if it already exists)
#
keyspace_definition: |
    CREATE KEYSPACE perf_nocl WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};

#
# Table info
#
table: perf_16

#
# The CQL for creating a table you wish to stress (optional if it already exists)
#
table_definition: |
  CREATE TABLE perf_16 (
        name text PRIMARY KEY,

   col1 blob,
   col2 blob,
   col3 blob,
   col4 blob,
   col5 blob,
   col6 blob,
   col7 blob,
   col8 blob,
   col9 blob,
   col10 blob,
   col11 blob,
   col12 blob,
   col13 blob,
   col14 blob,
   col15 blob,
   col16 blob
  )

#
# Optional meta information on the generated columns in the above table
# The min and max only apply to text and blob types
# The distribution field represents the total unique population
# distribution of that column across rows.  Supported types are
# 
#      EXP(min..max)                        An exponential distribution over the range [min..max]
#      EXTREME(min..max,shape)              An extreme value (Weibull) distribution over the range [min..max]
#      GAUSSIAN(min..max,stdvrng)           A gaussian/normal distribution, where mean=(min+max)/2, and stdev is (mean-min)/stdvrng
#      GAUSSIAN(min..max,mean,stdev)        A gaussian/normal distribution, with explicitly defined mean and stdev
#      UNIFORM(min..max)                    A uniform distribution over the range [min, max]
#      FIXED(val)                           A fixed distribution, always returning the same value
#      Aliases: extr, gauss, normal, norm, weibull
#
#      If preceded by ~, the distribution is inverted
#
# Defaults for all columns are size: uniform(4..8), population: uniform(1..100B), cluster: fixed(1)
#
columnspec:
  - name: name
    size: uniform(1..10)
    population: uniform(1..1M)     # the range of unique values to select for the field (default is 100Billion)

  - name: col1
    size: uniform(20..40)
  - name: col2
    size: uniform(20..40)
  - name: col3
    size: uniform(20..40)
  - name: col4
    size: uniform(20..40)
  - name: col5
    size: uniform(20..40)
  - name: col6
    size: uniform(20..40)
  - name: col7
    size: uniform(20..40)
  - name: col8
    size: uniform(20..40)
  - name: col9
    size: uniform(20..40)
  - name: col10
    size: uniform(20..40)
  - name: col11
    size: uniform(20..40)
  - name: col12
    size: uniform(20..40)
  - name: col13
    size: uniform(20..40)
  - name: col14
    size: uniform(20..40)
  - name: col15
    size: uniform(20..40)
  - name: col16
    size: uniform(20..40)

insert:
  partitions: fixed(1)
  batchtype: UNLOGGED
  select: fixed(1)/1 

#
# A list of queries you wish to run against the schema
#
queries:
   simple1:
      cql: select * from perf_16 where name = ? LIMIT 100
      fields: samerow             # samerow or multirow (select arguments from the same row, or randomly from all rows in the partition)

HEAD is 8405aa1
seastar branch has seastar-dev/issue-187-v1-work applied.

cassandra stress is

for i in $(seq $PROCESSES); do
    ~/cassandra/tools/bin/cassandra-stress user profile=$FILE ops\(insert=1\) duration=5m -rate threads=$THREADS -node $NODE -mode cql3 native -log file=$LDIR/$i &
done

where:

PROCESSES=16
THREADS=120

LDIR is whatever your heart desires

scylla cmd line is

sudo build/release/scylla --network-stack native --dpdk-pmd --dhcp 0  --collectd 1 --collectd-address 192.168.0.199:25826 --smp 24  --datadir /data --commitlog-directory  /data -m 24G --options-file conf/scylla.yaml --host-ipv4-addr 192.169.10.118  --gw-ipv4-addr 192.169.10.199 --netmask-ipv4-addr 255.255.255.0 --rpc-address 192.169.10.118 --seed-provider-parameters seeds=127.0.0.1
@glommer
Copy link
Contributor Author

glommer commented Sep 9, 2015

Hi Vlad.

You asked me to verify if this issue still happens with seastar-dev/issue-187-v2.

It does.

@vladzcloudius
Copy link
Contributor

On Sep 9, 2015 9:20 PM, "Glauber Costa" notifications@github.com wrote:

Hi Vlad.

You asked me to verify if this issue still happens with
seastar-dev/issue-187-v2.

It does.

Good. Thanks.


Reply to this email directly or view it on GitHub.

@slivne slivne added this to the anglerfish milestone Sep 10, 2015
@slivne
Copy link
Contributor

slivne commented Sep 13, 2015

Check if this is 40G specific

If this is 40G specific - invest uypto 0.5 day to find the issue in code - and validate it is only for 40G.

If its only for 40G and the firx does not take 2 hours - defer it we will move it to next release

@dorlaor
Copy link
Contributor

dorlaor commented Sep 13, 2015

On Sun, Sep 13, 2015 at 9:58 AM, slivne notifications@github.com wrote:

Check if this is 40G specific

IIRC @vladzcloudius said it's a generic issue which may not even be dpdk
specific.

If this is 40G specific - invest uypto 0.5 day to find the issue in code -
and validate it is only for 40G.

If its only for 40G and the firx does not take 2 hours - defer it we will
move it to next release


Reply to this email directly or view it on GitHub
#327 (comment)
.

@glommer
Copy link
Contributor Author

glommer commented Sep 17, 2015

Any idea what could the issue be?

I just hit this with POSIX. It is the first time I ever hit this with POSIX, and I was very surprised to see this happening.

In the client, I can see the same message:

com.datastax.driver.core.exceptions.DriverInternalError: An unexpected protocol error occurred. This is a bug in this library, please report: Invalid message version. Got � but previous messages on this connection had version �

In the server, I see an assertion being hit explicitly. Here is the backtrace:

#0  0x00007ffff40dd8d7 in raise () from /lib64/libc.so.6
#1  0x00007ffff40df53a in abort () from /lib64/libc.so.6
#2  0x00007ffff40d647d in __assert_fail_base () from /lib64/libc.so.6
#3  0x00007ffff40d6532 in __assert_fail () from /lib64/libc.so.6
#4  0x000000000087259f in schedule<future<T>::then(Func&&) [with Func = transport::cql_server::connection::do_flush()::<lambda()>; Result = future<>; T = {}]::<lambda(auto:1&&)> > (
    func=<unknown type in /home/glommer/urchin/build/release/scylla, CU 0x34ceade, DIE 0x372bdad>, this=0x60c000153c78) at /home/glommer/urchin/seastar/core/future.hh:643
#5  then<transport::cql_server::connection::do_flush()::<lambda()>, future<> > (func=<optimized out>, this=0x60c000153c78) at /home/glommer/urchin/seastar/core/future.hh:785
#6  transport::cql_server::connection::do_flush (this=this@entry=0x60c000153bd0) at transport/server.cc:819
#7  0x000000000087262b in transport::cql_server::poll_pending_responders (this=0x60c00018b010) at transport/server.cc:197
#8  0x00000000008726ba in operator() (__closure=0x7fffe81e6530) at transport/server.cc:419
#9  do_void_futurize_apply<transport::cql_server::connection::process()::<lambda()>::<lambda()>::<lambda()> > (
    func=<unknown type in /home/glommer/urchin/build/release/scylla, CU 0x34ceade, DIE 0x36cdf6d>) at /home/glommer/urchin/seastar/core/future.hh:1078
#10 apply<transport::cql_server::connection::process()::<lambda()>::<lambda()>::<lambda()> > (func=<unknown type in /home/glommer/urchin/build/release/scylla, CU 0x34ceade, DIE 0x36cdf9f>)
    at /home/glommer/urchin/seastar/core/future.hh:1126
#11 operator() (result=..., __closure=0x7fffe81e6530) at /home/glommer/urchin/seastar/core/future.hh:904
#12 futurize<future<> >::apply<future<T>::finally(Func&&) [with Func = transport::cql_server::connection::process()::<lambda()>::<lambda()>::<lambda()>; T = {}]::<lambda(future<>)>, future<> >(<unknown type in /home/glommer/urchin/build/release/scylla, CU 0x34ceade, DIE 0x372d2db>) (
    func=func@entry=<unknown type in /home/glommer/urchin/build/release/scylla, CU 0x34ceade, DIE 0x372d2db>) at /home/glommer/urchin/seastar/core/future.hh:1143
#13 0x0000000000872e42 in future<>::then_wrapped<future<T>::finally(Func&&) [with Func = transport::cql_server::connection::process()::<lambda()>::<lambda()>::<lambda()>; T = {}]::<lambda(future<>)>, future<> >(<unknown type in /home/glommer/urchin/build/release/scylla, CU 0x34ceade, DIE 0x372e8e5>) (this=0x60c000153c78, 
    func=func@entry=<unknown type in /home/glommer/urchin/build/release/scylla, CU 0x34ceade, DIE 0x372e8e5>) at /home/glommer/urchin/seastar/core/future.hh:822
#14 0x000000000087531a in finally<transport::cql_server::connection::process()::<lambda()>::<lambda()>::<lambda()> > (func=<optimized out>, this=<optimized out>)
    at /home/glommer/urchin/seastar/core/future.hh:905
#15 operator() (__closure=0x60c0005bf7b0) at transport/server.cc:421
#16 apply (args=<optimized out>, func=<unknown type in /home/glommer/urchin/build/release/scylla, CU 0x34ceade, DIE 0x36ce17a>) at /home/glommer/urchin/seastar/core/apply.hh:34
#17 apply<transport::cql_server::connection::process()::<lambda()>::<lambda()> > (args=<optimized out>, 
    func=<unknown type in /home/glommer/urchin/build/release/scylla, CU 0x34ceade, DIE 0x36ce1e8>) at /home/glommer/urchin/seastar/core/apply.hh:42
#18 apply<transport::cql_server::connection::process()::<lambda()>::<lambda()> > (args=<optimized out>, 
    func=<unknown type in /home/glommer/urchin/build/release/scylla, CU 0x34ceade, DIE 0x36ce228>) at /home/glommer/urchin/seastar/core/future.hh:1133
#19 operator()<future_state<> > (state=<unknown type in /home/glommer/urchin/build/release/scylla, CU 0x34ceade, DIE 0x373c7b8>, __closure=0x60c0005bf790)
    at /home/glommer/urchin/seastar/core/future.hh:789
#20 continuation<future<T>::then(Func&&) [with Func = transport::cql_server::connection::process()::<lambda()>::<lambda()>; Result = future<>; T = {}]::<lambda(auto:1&&)> >::run(void) (
    this=0x60c0005bf780) at /home/glommer/urchin/seastar/core/future.hh:362
#21 0x000000000047763f in reactor::run_tasks (this=this@entry=0x60c000119000, tasks=...) at core/reactor.cc:1113
#22 0x00000000004a295b in reactor::run (this=0x60c000119000) at core/reactor.cc:1230
#23 0x00000000004a70a6 in smp::<lambda()>::operator()(void) const (__closure=0x60000031a700) at core/reactor.cc:1887
#24 0x0000000000478749 in dpdk_thread_adaptor (f=<optimized out>) at core/reactor.cc:1752
#25 0x000000000061f7b3 in eal_thread_loop ()
#26 0x00007ffff446d52a in start_thread () from /lib64/libpthread.so.0
#27 0x00007ffff41a922d in clone () from /lib64/libc.so.6

Note that in the 24th frame in the backtrace above, there is a call to dpdk_thread_adaptor.

I don't know what this is, since I am running --network-stack posix

Technically, it may not even be the same bug: it may be that every time we forceably close the connection in the server we will see that in the client. And we may do that for different reasons.

But here it is...

@vladzcloudius
Copy link
Contributor

On Sep 13, 2015 9:58 AM, "slivne" notifications@github.com wrote:

Check if this is 40G specific

If this is 40G specific - invest uypto 0.5 day to find the issue in code

  • and validate it is only for 40G.

If its only for 40G and the firx does not take 2 hours - defer it we will
move it to next release

That's confusing. Could u clarify, pls?


Reply to this email directly or view it on GitHub.

@vladzcloudius
Copy link
Contributor

On Sep 13, 2015 10:04 AM, "Dor Laor" notifications@github.com wrote:

On Sun, Sep 13, 2015 at 9:58 AM, slivne notifications@github.com wrote:

Check if this is 40G specific

IIRC @vladzcloudius said it's a generic issue which may not even be dpdk
specific.

I only said it's absolutely unclear what the issue is.
I also said that i was unable to reproduce the issue with 10G.

So, it may still be 40G only issue.

If this is 40G specific - invest uypto 0.5 day to find the issue in
code -
and validate it is only for 40G.

If its only for 40G and the firx does not take 2 hours - defer it we
will
move it to next release


Reply to this email directly or view it on GitHub
<
#327 (comment)

.


Reply to this email directly or view it on GitHub.

@vladzcloudius
Copy link
Contributor

On Sep 17, 2015 11:29 PM, "Glauber Costa" notifications@github.com wrote:

Any idea what could the issue be?

I just hit this with POSIX. It is the first time I ever hit this with
POSIX, and I was very surprised to see this happening.

In the client, I can see the same message:

com.datastax.driver.core.exceptions.DriverInternalError: An unexpected
protocol error occurred. This is a bug in this library, please report:
Invalid message version. Got � but previous messages on this connection had
version �

In the server, I see an assertion being hit explicitly. Here is the
backtrace:

#0 0x00007ffff40dd8d7 in raise () from /lib64/libc.so.6
#1 0x00007ffff40df53a in abort () from /lib64/libc.so.6
#2 0x00007ffff40d647d in __assert_fail_base () from /lib64/libc.so.6
#3 0x00007ffff40d6532 in __assert_fail () from /lib64/libc.so.6
#4 0x000000000087259f in schedule<future::then(Func&&) [with Func =
transport::cql_server::connection::do_flush()::<lambda()>; Result =
future<>; T = {}]::<lambda(auto:1&&)> > (
func=<unknown type in /home/glommer/urchin/build/release/scylla, CU
0x34ceade, DIE 0x372bdad>, this=0x60c000153c78) at
/home/glommer/urchin/seastar/core/future.hh:643
#5 thentransport::cql_server::connection::do_flush()::<lambda(),
future<> > (func=, this=0x60c000153c78) at
/home/glommer/urchin/seastar/core/future.hh:785
#6 transport::cql_server::connection::do_flush (this=this@entry=0x60c000153bd0)
at transport/server.cc:819
#7 0x000000000087262b in transport::cql_server::poll_pending_responders
(this=0x60c00018b010) at transport/server.cc:197
#8 0x00000000008726ba in operator() (__closure=0x7fffe81e6530) at
transport/server.cc:419
#9
do_void_futurize_applytransport::cql_server::connection::process()::<lambda()::<lambda()>::<lambda()>
(
func=<unknown type in /home/glommer/urchin/build/release/scylla, CU
0x34ceade, DIE 0x36cdf6d>) at
/home/glommer/urchin/seastar/core/future.hh:1078
#10
applytransport::cql_server::connection::process()::<lambda()::<lambda()>::<lambda()>
(func=<unknown type in /home/glommer/urchin/build/release/scylla, CU
0x34ceade, DIE 0x36cdf9f>)
at /home/glommer/urchin/seastar/core/future.hh:1126
#11 operator() (result=..., __closure=0x7fffe81e6530) at
/home/glommer/urchin/seastar/core/future.hh:904
#12 futurize<future<> >::apply<future::finally(Func&&) [with Func =
transport::cql_server::connection::process()::<lambda()>::<lambda()>::<lambda()>;
T = {}]::<lambda(future<>)>, future<> >(<unknown type in
/home/glommer/urchin/build/release/scylla, CU 0x34ceade, DIE 0x372d2db>) (
func=func@entry=<unknown type in
/home/glommer/urchin/build/release/scylla, CU 0x34ceade, DIE 0x372d2db>) at
/home/glommer/urchin/seastar/core/future.hh:1143
#13 0x0000000000872e42 in
future<>::then_wrapped<future::finally(Func&&) [with Func =
transport::cql_server::connection::process()::<lambda()>::<lambda()>::<lambda()>;
T = {}]::<lambda(future<>)>, future<> >(<unknown type in
/home/glommer/urchin/build/release/scylla, CU 0x34ceade, DIE 0x372e8e5>)
(this=0x60c000153c78,
func=func@entry=<unknown type in
/home/glommer/urchin/build/release/scylla, CU 0x34ceade, DIE 0x372e8e5>) at
/home/glommer/urchin/seastar/core/future.hh:822
#14 0x000000000087531a in
finallytransport::cql_server::connection::process()::<lambda()::<lambda()>::<lambda()>
(func=, this=)
at /home/glommer/urchin/seastar/core/future.hh:905
#15 operator() (__closure=0x60c0005bf7b0) at transport/server.cc:421
#16 apply (args=, func=<unknown type in
/home/glommer/urchin/build/release/scylla, CU 0x34ceade, DIE 0x36ce17a>) at
/home/glommer/urchin/seastar/core/apply.hh:34
#17
applytransport::cql_server::connection::process()::<lambda()::<lambda()>
(args=,
func=<unknown type in /home/glommer/urchin/build/release/scylla, CU
0x34ceade, DIE 0x36ce1e8>) at /home/glommer/urchin/seastar/core/apply.hh:42
#18
applytransport::cql_server::connection::process()::<lambda()::<lambda()>
(args=,
func=<unknown type in /home/glommer/urchin/build/release/scylla, CU
0x34ceade, DIE 0x36ce228>) at
/home/glommer/urchin/seastar/core/future.hh:1133
#19 operator()<future_state<> > (state=<unknown type in
/home/glommer/urchin/build/release/scylla, CU 0x34ceade, DIE 0x373c7b8>,
__closure=0x60c0005bf790)
at /home/glommer/urchin/seastar/core/future.hh:789
#20 continuation<future::then(Func&&) [with Func =
transport::cql_server::connection::process()::<lambda()>::<lambda()>;
Result = future<>; T = {}]::<lambda(auto:1&&)> >::run(void) (
this=0x60c0005bf780) at
/home/glommer/urchin/seastar/core/future.hh:362
#21 0x000000000047763f in reactor::run_tasks (this=this@entry=0x60c000119000,
tasks=...) at core/reactor.cc:1113
#22 0x00000000004a295b in reactor::run (this=0x60c000119000) at
core/reactor.cc:1230
#23 0x00000000004a70a6 in smp::<lambda()>::operator()(void) const
(__closure=0x60000031a700) at core/reactor.cc:1887
#24 0x0000000000478749 in dpdk_thread_adaptor (f=) at
core/reactor.cc:1752
#25 0x000000000061f7b3 in eal_thread_loop ()
#26 0x00007ffff446d52a in start_thread () from /lib64/libpthread.so.0
#27 0x00007ffff41a922d in clone () from /lib64/libc.so.6

Note that in the 24th frame in the backtrace above, there is a call to
dpdk_thread_adaptor.

I don't know what this is, since I am running --network-stack posix

As Nadav has mentioned - it's by design and it's harmless. Gleb has added
this and he may remember better - why. ASAIR it was related to CPU indexing
inside DPDK SDK.

Technically, it may not even be the same bug: it may be that every time
we forceably close the connection in the server we will see that in the
client. And we may do that for different reasons.

But here it is...


Reply to this email directly or view it on GitHub.

@vladzcloudius
Copy link
Contributor

On Sep 17, 2015 11:29 PM, "Glauber Costa" notifications@github.com wrote:

Any idea what could the issue be?

I just hit this with POSIX. It is the first time I ever hit this with
POSIX, and I was very surprised to see this happening.

In the client, I can see the same message:

com.datastax.driver.core.exceptions.DriverInternalError: An unexpected
protocol error occurred. This is a bug in this library, please report:
Invalid message version. Got � but previous messages on this connection had
version �

In the server, I see an assertion being hit explicitly. Here is the
backtrace:

#0 0x00007ffff40dd8d7 in raise () from /lib64/libc.so.6
#1 0x00007ffff40df53a in abort () from /lib64/libc.so.6
#2 0x00007ffff40d647d in __assert_fail_base () from /lib64/libc.so.6
#3 0x00007ffff40d6532 in __assert_fail () from /lib64/libc.so.6
#4 0x000000000087259f in schedule<future::then(Func&&) [with Func =
transport::cql_server::connection::do_flush()::<lambda()>; Result =
future<>; T = {}]::<lambda(auto:1&&)> > (
func=<unknown type in /home/glommer/urchin/build/release/scylla, CU
0x34ceade, DIE 0x372bdad>, this=0x60c000153c78) at
/home/glommer/urchin/seastar/core/future.hh:643
#5 thentransport::cql_server::connection::do_flush()::<lambda(),
future<> > (func=, this=0x60c000153c78) at
/home/glommer/urchin/seastar/core/future.hh:785
#6 transport::cql_server::connection::do_flush (this=this@entry=0x60c000153bd0)
at transport/server.cc:819
#7 0x000000000087262b in transport::cql_server::poll_pending_responders
(this=0x60c00018b010) at transport/server.cc:197
#8 0x00000000008726ba in operator() (__closure=0x7fffe81e6530) at
transport/server.cc:419
#9
do_void_futurize_applytransport::cql_server::connection::process()::<lambda()::<lambda()>::<lambda()>
(
func=<unknown type in /home/glommer/urchin/build/release/scylla, CU
0x34ceade, DIE 0x36cdf6d>) at
/home/glommer/urchin/seastar/core/future.hh:1078
#10
applytransport::cql_server::connection::process()::<lambda()::<lambda()>::<lambda()>
(func=<unknown type in /home/glommer/urchin/build/release/scylla, CU
0x34ceade, DIE 0x36cdf9f>)
at /home/glommer/urchin/seastar/core/future.hh:1126
#11 operator() (result=..., __closure=0x7fffe81e6530) at
/home/glommer/urchin/seastar/core/future.hh:904
#12 futurize<future<> >::apply<future::finally(Func&&) [with Func =
transport::cql_server::connection::process()::<lambda()>::<lambda()>::<lambda()>;
T = {}]::<lambda(future<>)>, future<> >(<unknown type in
/home/glommer/urchin/build/release/scylla, CU 0x34ceade, DIE 0x372d2db>) (
func=func@entry=<unknown type in
/home/glommer/urchin/build/release/scylla, CU 0x34ceade, DIE 0x372d2db>) at
/home/glommer/urchin/seastar/core/future.hh:1143
#13 0x0000000000872e42 in
future<>::then_wrapped<future::finally(Func&&) [with Func =
transport::cql_server::connection::process()::<lambda()>::<lambda()>::<lambda()>;
T = {}]::<lambda(future<>)>, future<> >(<unknown type in
/home/glommer/urchin/build/release/scylla, CU 0x34ceade, DIE 0x372e8e5>)
(this=0x60c000153c78,
func=func@entry=<unknown type in
/home/glommer/urchin/build/release/scylla, CU 0x34ceade, DIE 0x372e8e5>) at
/home/glommer/urchin/seastar/core/future.hh:822
#14 0x000000000087531a in
finallytransport::cql_server::connection::process()::<lambda()::<lambda()>::<lambda()>
(func=, this=)
at /home/glommer/urchin/seastar/core/future.hh:905
#15 operator() (__closure=0x60c0005bf7b0) at transport/server.cc:421
#16 apply (args=, func=<unknown type in
/home/glommer/urchin/build/release/scylla, CU 0x34ceade, DIE 0x36ce17a>) at
/home/glommer/urchin/seastar/core/apply.hh:34
#17
applytransport::cql_server::connection::process()::<lambda()::<lambda()>
(args=,
func=<unknown type in /home/glommer/urchin/build/release/scylla, CU
0x34ceade, DIE 0x36ce1e8>) at /home/glommer/urchin/seastar/core/apply.hh:42
#18
applytransport::cql_server::connection::process()::<lambda()::<lambda()>
(args=,
func=<unknown type in /home/glommer/urchin/build/release/scylla, CU
0x34ceade, DIE 0x36ce228>) at
/home/glommer/urchin/seastar/core/future.hh:1133
#19 operator()<future_state<> > (state=<unknown type in
/home/glommer/urchin/build/release/scylla, CU 0x34ceade, DIE 0x373c7b8>,
__closure=0x60c0005bf790)
at /home/glommer/urchin/seastar/core/future.hh:789
#20 continuation<future::then(Func&&) [with Func =
transport::cql_server::connection::process()::<lambda()>::<lambda()>;
Result = future<>; T = {}]::<lambda(auto:1&&)> >::run(void) (
this=0x60c0005bf780) at
/home/glommer/urchin/seastar/core/future.hh:362
#21 0x000000000047763f in reactor::run_tasks (this=this@entry=0x60c000119000,
tasks=...) at core/reactor.cc:1113
#22 0x00000000004a295b in reactor::run (this=0x60c000119000) at
core/reactor.cc:1230
#23 0x00000000004a70a6 in smp::<lambda()>::operator()(void) const
(__closure=0x60000031a700) at core/reactor.cc:1887
#24 0x0000000000478749 in dpdk_thread_adaptor (f=) at
core/reactor.cc:1752
#25 0x000000000061f7b3 in eal_thread_loop ()
#26 0x00007ffff446d52a in start_thread () from /lib64/libpthread.so.0
#27 0x00007ffff41a922d in clone () from /lib64/libc.so.6

Note that in the 24th frame in the backtrace above, there is a call to
dpdk_thread_adaptor.

I don't know what this is, since I am running --network-stack posix

Technically, it may not even be the same bug: it may be that every time
we forceably close the connection in the server we will see that in the
client. And we may do that for different reasons.

Isn't do_flush() a function recently patched by Avi in the "batching"
series? If i'm not hallucinating then this may even be a regression...

But here it is...


Reply to this email directly or view it on GitHub.

@glommer
Copy link
Contributor Author

glommer commented Sep 17, 2015

@avikivity can comment, but while do_flush is introduced by him, the real problem here is that do_flush finds _ready_to_respond to be empty. His patch does not seem to touch assignments to _ready_to_respond, so I would say he is innocent on this - maybe made it more likely on POSIX.

@glommer
Copy link
Contributor Author

glommer commented Sep 18, 2015

I have reverted Avi's patches, and while the assertion stopped happening, the client side errors lingered.

I don't think we want to revert @avikivity's patches for the release. So we should really find the issue here and fix it.

@vladzcloudius
Copy link
Contributor

On Sep 18, 2015 5:43 PM, "Glauber Costa" notifications@github.com wrote:

I have reverted Avi's patches, and while the assertion stopped happening,
the client side errors lingered.

I don't think we want to revert @avikivity's patches for the release. So
we should really find the issue here and fix it.

One "good" news here is that it's apparently not related to native stack
mode, DPDK or 40G cards. I was suspecting there is possibly some memory
trashing in the scylla that ends up in the garbadge data to be sent...

But cutting the long story short i agree with Glauber - it seems like a
general issue and quite a nasty one that we would rather fix ASAP...


Reply to this email directly or view it on GitHub.

@slivne
Copy link
Contributor

slivne commented Sep 18, 2015

I am getting the issue of do_flush on AWS AMI using head

using ami:

ami-1fa3d37a

using head

commit 5f32c00
Author: Avi Kivity avi@cloudius-systems.com
Date: Thu Sep 17 15:07:10 2015 +0300

transport: fix removal of response batch poller when then connection terminates

We remove the poller too early; after _ready_to_respond becomes ready,
it is likely to have been inserted again.

and the three patches I sent on seastar-dev:shlomi/fix_dist_v3

           PID: 885 (scylla)
           UID: 992 (scylla)
           GID: 1001 (scylla)
        Signal: 6 (ABRT)
     Timestamp: Fri 2015-09-18 22:10:19 UTC (1h 8min ago)
  Command Line: /usr/bin/scylla --log-to-syslog 1 --log-to-stdout 0 --default-log-level info --options-file /var/lib/scylla/conf/scylla.yaml --network-stack posix
    Executable: /usr/bin/scylla
 Control Group: /user.slice/user-992.slice/session-c1.scope
          Unit: session-c1.scope
         Slice: user-992.slice
       Session: c1
     Owner UID: 992 (scylla)
       Boot ID: 136fad499440450683d44b8b7f7b8629
    Machine ID: 1cb6c3aa84d24db1b0bbb0bcaefc0ebd
      Hostname: ip-172-30-0-6
       Message: Process 885 (scylla) of user 992 dumped core.

                Stack trace of thread 897:
                #0  0x00007ff2ff9209c8 raise (libc.so.6)
                #1  0x00007ff2ff92265a abort (libc.so.6)
                #2  0x00007ff2ff919187 __assert_fail_base (libc.so.6)
                #3  0x00007ff2ff919232 __assert_fail (libc.so.6)
                #4  0x00000000008ab604 schedule<future<T>::then(Func&&) [with Func = transport::cql_server::connection::do_flush()::<lambda()>; Result = future<>; T = {}]::<lambda(auto:1&&)> > (scylla)
                #5  0x00000000008ab6db _ZN9transport10cql_server23poll_pending_respondersEv (scylla)
                #6  0x00000000008ad55a operator() (scylla)
                #7  0x00000000008ade42 then_wrapped<future<T>::finally(Func&&) [with Func = transport::cql_server::connection::process()::<lambda()>::<lambda()>::<lambda()>; T = {}]::<lambda(future<>)>, future<>
                #8  0x00000000008aebc1 finally<transport::cql_server::connection::process()::<lambda()>::<lambda()>::<lambda()> > (scylla)
                #9  0x00000000008b1aa6 operator()<future_state<> > (scylla)
                #10 0x000000000046f83d _ZN7reactor9run_tasksER15circular_bufferISt10unique_ptrI4taskSt14default_deleteIS2_EESaIS5_EE (scylla)
                #11 0x00000000004951fc _ZN7reactor3runEv (scylla)
                #12 0x00000000004a1091 _ZZN3smp9configureEN5boost15program_options13variables_mapEENKUlvE_clEv.constprop.2514 (scylla)
                #13 0x000000000046c07e _ZNKSt8functionIFvvEEclEv (scylla)
                #14 0x0000000000634c5b eal_thread_loop (scylla)
                #15 0x00007ff2ffcb3555 start_thread (libpthread.so.0)
                #16 0x00007ff2ff9eeb9d __clone (libc.so.6)

machine on AWS is 54.173.222.60

@avikivity as this has debuginfo installed as well - maybe this will simplify finding the issue

I am leaving this machine up for now

@slivne
Copy link
Contributor

slivne commented Oct 1, 2015

@gleb-cloudius, @pdziepak - I think the patch

commit a15c062
Author: Gleb Natapov gleb@cloudius-systems.com
Date: Mon Sep 21 18:02:43 2015 +0300

transport: fix poller removal

During cql connection removal we wait for all outstanding sends to
complete by waiting for _ready_to_respond future to resolve, but if
at this point connection is in _pending_responders then poller my call
do_flush() and try to reuse same _ready_to_respond future that already has
a continuation attached to it. The fix is to remove connection from
the poller before waiting for _ready_to_respond. The special measures
should be taken to prevent the connection from been added to the poller
again, so we set _flush_requested to avoid exactly that.

fix this issue , can I close this ?

@slivne
Copy link
Contributor

slivne commented Oct 7, 2015

@gleb-cloudius, @pdziepak ping

@gleb-cloudius
Copy link
Contributor

On Wed, Oct 07, 2015 at 06:41:20AM -0700, slivne wrote:

@gleb-cloudius, @pdziepak ping

I have no idea. Lets close and re-open if somebody will encounter it
again.

        Gleb.

@slivne slivne closed this as completed Oct 7, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants