Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coredump when using latest nightly rustc #37105

Closed
BusyJay opened this Issue Oct 12, 2016 · 39 comments

Comments

Projects
None yet
@BusyJay
Copy link

BusyJay commented Oct 12, 2016

Hi, recently we upgrade our rustc compiler to the latest nightly version, but the compiled binary core dumps quickly under stress tests. A few stacks can be found in tikv/tikv#1144. But when we downgrade rustc to rustc 1.12.0-nightly (b30eff7ba 2016-08-05), the binary works just fine.

The stacks look weird, because the segment fault happens in liballoc, but we don't manage memory by ourselves. We are guessing that there might be some problems in the versions later than rustc 1.12.0-nightly (b30eff7ba 2016-08-05). Could you please help us check it out? Thanks!

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Oct 12, 2016

Thanks for the report @BusyJay! Is there also a set of steps to reproduce the crash you're seeing as well locally? Also, is this on Linux?

@BusyJay

This comment has been minimized.

Copy link
Author

BusyJay commented Oct 12, 2016

Yes, it's on Linux. It needs a few steps to reproduce the crash, we will provide it via a demo project later.

@siddontang

This comment has been minimized.

Copy link

siddontang commented Oct 14, 2016

Hi @alexcrichton

The steps to reproduce the crash may be not easy:

Download official binaries

wget http://download.pingcap.org/tidb-latest-linux-amd64.tar.gz
tar -xzf tidb-latest-linux-amd64.tar.gz
cd tidb-latest-linux-amd64

You can use tidb-server and pd-server in bin directory.

Clone TiKV and build

git clone https://github.com/pingcap/tikv.git tikv 
cd tikv
make static_release 
// If you have already installed RocksDB (ver > 4.12), you can use `make` directly.

The tikv-server is installed in bin directory.

Run PD

./bin/pd-server --cluster-id=1 --data-dir=./var/pd

pd-server will listen 2379 and 2380 ports.

Run TiKV

./bin/tikv-server -I 1 -S raftkv --pd 127.0.0.1:2379 -s ./var/tikv

tikv-server will listen 20160 port.

Run TiDB

./bin/tidb-server --store=tikv --path="127.0.0.1:2379?cluster=1" 

tidb-server will listen 4000 port.

Run Test

Unzip bank test bank.zip and run

./bank 

Then wait a long time......

If you have any problem, please tell me.

@brson brson added the T-compiler label Oct 20, 2016

@eddyb

This comment has been minimized.

Copy link
Member

eddyb commented Oct 20, 2016

(Shot in the dark) Does this reproduce on rustc 1.12.0-nightly (b30eff7ba 2016-08-05) when built with the environment variable RUSTFLAGS set to -Zorbit?

@brson

This comment has been minimized.

Copy link
Contributor

brson commented Oct 20, 2016

@siddontang Is there any way this crash can be reduced to a smaller test case? It's going to be quite tricky to track down otherwise. Do you have instructions for reproducing the whole thing from source, without a binary download? If parts of it are not open source we may be able to arrange to debug it privately, but in any case a smaller test case would help immensely.

@TimNN

This comment has been minimized.

Copy link
Contributor

TimNN commented Oct 20, 2016

@eddyb: Doesn't that nightly already default to orbit? As far as I can tell the switch happened between nightly-2016-08-02 and nightly-2016-08-03.

@eddyb

This comment has been minimized.

Copy link
Member

eddyb commented Oct 20, 2016

@TimNN Ah you are indeed correct. Well, that's one down, 99 potential causes to go.

@eddyb

This comment has been minimized.

Copy link
Member

eddyb commented Oct 20, 2016

Everyone involved in this thread: if you haven't already, check out http://rr-project.org/.
Reverse debugging might be the only affordable way to nail this bug down.

@siddontang

This comment has been minimized.

Copy link

siddontang commented Oct 22, 2016

Hi @brson

We tried to reproduce this coredump with a simple test, but failed. 😭
You can't use tikv alone, must with PD and TiDB.

PD and TiDB are both written with Go, and are all open source under Apache-2, so you can use the pd-server and tidb-server binaries freely.

If you want to build yourself, you must install go 1.6+ first (https://golang.org/doc/install), then:

git clone https://github.com/pingcap/tidb.git $GOPATH/src/github.com/pingcap/tidb
cd $GOPATH/src/github.com/pingcap/tidb
make 
# the tidb-server is installed in $GOPATH/src/github.com/pingcap/tidb/bin directory 

git clone https://github.com/pingcap/pd.git $GOPATH/src/github.com/pingcap/pd
cd $GOPATH/src/github.com/pingcap/pd
make
# the pd-server is installed in $GOPATH/src/github.com/pingcap/pd/bin directory 
@pnkfelix

This comment has been minimized.

Copy link
Member

pnkfelix commented Oct 28, 2016

@siddontang what is the source to the bank program? The zip file only has an executable.

@pnkfelix

This comment has been minimized.

Copy link
Member

pnkfelix commented Oct 28, 2016

@siddontang I cannot reproduce the scenario; when I try to run the bank program, I get the following output:

https://gist.github.com/pnkfelix/4c87f20badee2c5110c23005984830cd

and the tidb-server terminates with the output:

2016/10/28 18:55:27 server.go:167: [error] accept error accept tcp [::]:4000: accept4: too many open files
2016/10/28 18:55:28 main.go:145: [error] accept tcp [::]:4000: accept4: too many open files

(The other two services keep running...)

I second @eddyb 's suggestion of trying to use rr to look into this on your end.

@siddontang

This comment has been minimized.

Copy link

siddontang commented Oct 29, 2016

Hi @pnkfelix
Sorry that I forgot give you the bank source, it is here https://gist.github.com/siddontang/2fb9cd4cf736199b0b017f7263c41ab4

In the bank case, we will create at lease concurrency(default is 10000) connections.
Seem that you should use ulimit -n 10240 to ensure we can open so many connections :-)

Of course, you can use another concurrency in bank like ./bank --concurrency 2000, this will only use 2000 connections.

@brson

This comment has been minimized.

Copy link
Contributor

brson commented Nov 3, 2016

Thanks for the new info @siddontang. @pnkfelix maybe you can try again?

@pnkfelix pnkfelix self-assigned this Nov 3, 2016

@siddontang

This comment has been minimized.

Copy link

siddontang commented Nov 11, 2016

Hi @pnkfelix, can you reproduce it?

@brson brson added the I-nominated label Nov 17, 2016

@brson

This comment has been minimized.

Copy link
Contributor

brson commented Nov 17, 2016

@rust-lang/compiler this needs a P- tag.

@nikomatsakis

This comment has been minimized.

Copy link
Contributor

nikomatsakis commented Nov 17, 2016

triage: P-high

In particular, we should figure out if we can reproduce this or not!

@rust-highfive rust-highfive added P-high and removed I-nominated labels Nov 17, 2016

@arielb1

This comment has been minimized.

Copy link
Contributor

arielb1 commented Nov 17, 2016

@BusyJay

If you could reproduce the crash under rr, that would help us a lot.

@BusyJay

This comment has been minimized.

Copy link
Author

BusyJay commented Nov 18, 2016

Just tested with rr, but got an unexpected error. I will retry it later once it's resolved.

@sanxiyn

This comment has been minimized.

Copy link
Member

sanxiyn commented Nov 19, 2016

rr issue is resolved.

@BusyJay

This comment has been minimized.

Copy link
Author

BusyJay commented Nov 19, 2016

Yes, and I have been testing it for more than 12 hours, and it still not crash yet. I guess rr emulates a single-core machine just makes it very slow.

@brson

This comment has been minimized.

Copy link
Contributor

brson commented Dec 1, 2016

It's looking like we're unlikely to solve this before release.

@nikomatsakis

This comment has been minimized.

Copy link
Contributor

nikomatsakis commented Dec 8, 2016

In @rust-lang/compiler meeting, discussed. We're basically still having trouble reproducing the bug (ideally in rr). No real status update.

@brson

This comment has been minimized.

Copy link
Contributor

brson commented Dec 15, 2016

@pnkfelix maybe we can reduce this to P-medium until it's clear there's a bug to tackle on the rust side here.

@nikomatsakis

This comment has been minimized.

Copy link
Contributor

nikomatsakis commented Dec 22, 2016

triage: P-medium

Seeing as how we have not been able to reproduce, and we've basically stalled out, we're going to downgrade this in priority.

@BusyJay, please let us know current status (is this still reproducing for you outside rr?) and if there is anything we can do to help track it down.

@rust-highfive rust-highfive added P-medium and removed P-high labels Dec 22, 2016

@nikomatsakis

This comment has been minimized.

Copy link
Contributor

nikomatsakis commented Dec 22, 2016

Is this now stable-to-stable?

@siddontang

This comment has been minimized.

Copy link

siddontang commented Dec 23, 2016

Hi @nikomatsakis

Sorry that we don't have enough time to reproduce it. We tested it three weeks ago and this issue still existed. We can test it again after you release the newest nightly version :-)

@nikomatsakis

This comment has been minimized.

Copy link
Contributor

nikomatsakis commented Dec 27, 2016

@siddontang

Sorry that we don't have enough time to reproduce it.

I'm sorry that we can't reproduce it either. =( Please do give it another try so at least we know if it is still a problem!

@nikomatsakis

This comment has been minimized.

Copy link
Contributor

nikomatsakis commented Dec 27, 2016

Have we at least narrowed this down to a specific nightly where the problem seems to occur? It seems like it is not due to the switch to MIR, right?

@DemiMarie

This comment has been minimized.

Copy link
Contributor

DemiMarie commented Dec 31, 2016

Could this be some sort of data race? Being unable to reduce it under rr sounds like a race condition.

Can you try ThreadSanitizer?

@siddontang

This comment has been minimized.

Copy link

siddontang commented Jan 2, 2017

@nikomatsakis

We do some test and can't reproduce it with the newest rust + newest TiKV, but to our surprise, we can reproduce it with the newest rust + old TiKV (2016-08-21 version). Our rust version is:

rustc -V
rustc 1.14.0 (e8a012324 2016-12-16)

We don't know why now, maybe the changes in TiKV skip the trigger condition for the core dump, or the problem still exists but we don't meet it sadly.

Now we decide to use the newest rust for TiKV, and if we meet this core dump later, we will update the issue.

@nikomatsakis

This comment has been minimized.

Copy link
Contributor

nikomatsakis commented Jan 2, 2017

@siddontang ok, well, I'm glad you're not hitting the issue anymore, but I wish we had a better handle on what the problem is exactly. Of course, it is also possible that the bug is in fact in TikV (or some other package featuring unsafe code), so it's quite likely that the problem is indeed fixed by a newer version.

@nikomatsakis

This comment has been minimized.

Copy link
Contributor

nikomatsakis commented Jan 2, 2017

I'm going to downgrade to P-low until we have more data.

@nikomatsakis

This comment has been minimized.

Copy link
Contributor

nikomatsakis commented Jan 2, 2017

triage: P-low

@rust-highfive rust-highfive added P-low and removed P-medium labels Jan 2, 2017

@siddontang

This comment has been minimized.

Copy link

siddontang commented Jan 8, 2017

Hi @nikomatsakis

Sadly, we meet the coredump again with the newest rustc + newest TiKV. 😭

We will try to reproduce it again.

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `bin/tikv-server --addr 0.0.0.0:10160 --advertise-addr 10.2.0.91:10160 --pd 10.2'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000559d7ec9f319 in je_arena_sdalloc (tcache=0x7ffb315ab000, size=114688, ptr=0x559d7f0d0748 <vtable.4C>, tsd=<optimized out>)
    at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/liballoc_jemalloc/../jemalloc/include/jemalloc/internal/arena.h:1439
1439	/buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/liballoc_jemalloc/../jemalloc/include/jemalloc/internal/arena.h: No such file or directory.
warning: Missing auto-load scripts referenced in section .debug_gdb_scripts
of file /home/work/deploy/bin/tikv-server
Use `info auto-load python-scripts [REGEXP]' to list them.
(gdb) bt
#0  0x0000559d7ec9f319 in je_arena_sdalloc (tcache=0x7ffb315ab000, size=114688, ptr=0x559d7f0d0748 <vtable.4C>, tsd=<optimized out>)
    at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/liballoc_jemalloc/../jemalloc/include/jemalloc/internal/arena.h:1439
#1  je_isdalloct (tcache=0x7ffb315ab000, size=114688, ptr=0x559d7f0d0748 <vtable.4C>, tsd=<optimized out>) at include/jemalloc/internal/jemalloc_internal.h:1087
#2  je_isqalloc (tcache=0x7ffb315ab000, size=114688, ptr=0x559d7f0d0748 <vtable.4C>, tsd=<optimized out>) at include/jemalloc/internal/jemalloc_internal.h:1097
#3  isfree (tcache=0x7ffb315ab000, usize=114688, ptr=0x559d7f0d0748 <vtable.4C>, tsd=<optimized out>)
    at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/liballoc_jemalloc/../jemalloc/src/jemalloc.c:1842
#4  sdallocx (ptr=0x559d7f0d0748 <vtable.4C>, size=<optimized out>, flags=<optimized out>)
    at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/liballoc_jemalloc/../jemalloc/src/jemalloc.c:2532
#5  0x0000559d7e75ee7f in deallocate (ptr=0x559d7f0d0748 <vtable.4C> "p\267]~\235U", old_size=112858, align=1)
    at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/liballoc/heap.rs:113
#6  drop<u8> (self=<optimized out>) at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/liballoc/raw_vec.rs:552
#7  tikv::server::coprocessor::endpoint::{{impl}}::run_batch (self=0x7ffb155fdce0, tasks=<optimized out>) at /home/pingcap/cwen/tikv/src/server/coprocessor/endpoint.rs:183
#8  0x0000559d7e58e78a in poll<tikv::server::coprocessor::endpoint::Host,tikv::server::coprocessor::endpoint::Task> (batch_size=50, runner=..., rx=..., counter=...)
    at /home/pingcap/cwen/tikv/src/util/worker/mod.rs:161
#9  {{closure}}<tikv::server::coprocessor::endpoint::Task,tikv::server::coprocessor::endpoint::Host> () at /home/pingcap/cwen/tikv/src/util/worker/mod.rs:196
#10 call_once<(),closure> (self=..., _args=<optimized out>) at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libstd/panic.rs:295
@nikomatsakis

This comment has been minimized.

Copy link
Contributor

nikomatsakis commented Jan 9, 2017

@siddontang argh, sorry to hear that. :(

@siddontang

This comment has been minimized.

Copy link

siddontang commented Feb 8, 2017

Hi @nikomatsakis @pnkfelix

A strange update, after we merge tikv/tikv#1512, we find that using newest rust is ok, we run many tests for a long time and the core dump doesn't happen, so we guess this PR fixes the problem, but we don't know why, could you help us to find the reason?

We used nightly-2016-08-06 before, so I think the bug is introduced after this version.

@brson

This comment has been minimized.

Copy link
Contributor

brson commented Feb 9, 2017

@thanks for the continued investigations and update @siddontang .

@pnkfelix

This comment has been minimized.

Copy link
Member

pnkfelix commented Aug 31, 2017

unassigning self.

I'm not sure we can reasonably expect to determine the underlying problem that has either been fixed or masked, since as far as I can tell, no one working on the rustc compiler has locally reproduced the problem.

@steveklabnik

This comment has been minimized.

Copy link
Member

steveklabnik commented Mar 4, 2019

It’s been over a year since any update, and almost two years since this issue was originally reported. I’m going to go ahead and close this, @BusyJay, if you have a way to reproduce and still care about this, please let me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.