Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coredump when using latest nightly rustc #37105

Closed
BusyJay opened this issue Oct 12, 2016 · 39 comments
Closed

Coredump when using latest nightly rustc #37105

BusyJay opened this issue Oct 12, 2016 · 39 comments
Labels
C-bug Category: This is a bug. I-crash Issue: The compiler crashes (SIGSEGV, SIGABRT, etc). Use I-ICE instead when the compiler panics. P-low Low priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@BusyJay
Copy link

BusyJay commented Oct 12, 2016

Hi, recently we upgrade our rustc compiler to the latest nightly version, but the compiled binary core dumps quickly under stress tests. A few stacks can be found in tikv/tikv#1144. But when we downgrade rustc to rustc 1.12.0-nightly (b30eff7ba 2016-08-05), the binary works just fine.

The stacks look weird, because the segment fault happens in liballoc, but we don't manage memory by ourselves. We are guessing that there might be some problems in the versions later than rustc 1.12.0-nightly (b30eff7ba 2016-08-05). Could you please help us check it out? Thanks!

@sfackler sfackler added I-crash Issue: The compiler crashes (SIGSEGV, SIGABRT, etc). Use I-ICE instead when the compiler panics. regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. labels Oct 12, 2016
@alexcrichton
Copy link
Member

Thanks for the report @BusyJay! Is there also a set of steps to reproduce the crash you're seeing as well locally? Also, is this on Linux?

@BusyJay
Copy link
Author

BusyJay commented Oct 12, 2016

Yes, it's on Linux. It needs a few steps to reproduce the crash, we will provide it via a demo project later.

@siddontang
Copy link

siddontang commented Oct 14, 2016

Hi @alexcrichton

The steps to reproduce the crash may be not easy:

Download official binaries

wget http://download.pingcap.org/tidb-latest-linux-amd64.tar.gz
tar -xzf tidb-latest-linux-amd64.tar.gz
cd tidb-latest-linux-amd64

You can use tidb-server and pd-server in bin directory.

Clone TiKV and build

git clone https://github.com/pingcap/tikv.git tikv 
cd tikv
make static_release 
// If you have already installed RocksDB (ver > 4.12), you can use `make` directly.

The tikv-server is installed in bin directory.

Run PD

./bin/pd-server --cluster-id=1 --data-dir=./var/pd

pd-server will listen 2379 and 2380 ports.

Run TiKV

./bin/tikv-server -I 1 -S raftkv --pd 127.0.0.1:2379 -s ./var/tikv

tikv-server will listen 20160 port.

Run TiDB

./bin/tidb-server --store=tikv --path="127.0.0.1:2379?cluster=1" 

tidb-server will listen 4000 port.

Run Test

Unzip bank test bank.zip and run

./bank 

Then wait a long time......

If you have any problem, please tell me.

@brson brson added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Oct 20, 2016
@eddyb
Copy link
Member

eddyb commented Oct 20, 2016

(Shot in the dark) Does this reproduce on rustc 1.12.0-nightly (b30eff7ba 2016-08-05) when built with the environment variable RUSTFLAGS set to -Zorbit?

@brson
Copy link
Contributor

brson commented Oct 20, 2016

@siddontang Is there any way this crash can be reduced to a smaller test case? It's going to be quite tricky to track down otherwise. Do you have instructions for reproducing the whole thing from source, without a binary download? If parts of it are not open source we may be able to arrange to debug it privately, but in any case a smaller test case would help immensely.

@TimNN
Copy link
Contributor

TimNN commented Oct 20, 2016

@eddyb: Doesn't that nightly already default to orbit? As far as I can tell the switch happened between nightly-2016-08-02 and nightly-2016-08-03.

@eddyb
Copy link
Member

eddyb commented Oct 20, 2016

@TimNN Ah you are indeed correct. Well, that's one down, 99 potential causes to go.

@eddyb
Copy link
Member

eddyb commented Oct 20, 2016

Everyone involved in this thread: if you haven't already, check out http://rr-project.org/.
Reverse debugging might be the only affordable way to nail this bug down.

@siddontang
Copy link

Hi @brson

We tried to reproduce this coredump with a simple test, but failed. 😭
You can't use tikv alone, must with PD and TiDB.

PD and TiDB are both written with Go, and are all open source under Apache-2, so you can use the pd-server and tidb-server binaries freely.

If you want to build yourself, you must install go 1.6+ first (https://golang.org/doc/install), then:

git clone https://github.com/pingcap/tidb.git $GOPATH/src/github.com/pingcap/tidb
cd $GOPATH/src/github.com/pingcap/tidb
make 
# the tidb-server is installed in $GOPATH/src/github.com/pingcap/tidb/bin directory 

git clone https://github.com/pingcap/pd.git $GOPATH/src/github.com/pingcap/pd
cd $GOPATH/src/github.com/pingcap/pd
make
# the pd-server is installed in $GOPATH/src/github.com/pingcap/pd/bin directory 

@pnkfelix
Copy link
Member

@siddontang what is the source to the bank program? The zip file only has an executable.

@pnkfelix
Copy link
Member

@siddontang I cannot reproduce the scenario; when I try to run the bank program, I get the following output:

https://gist.github.com/pnkfelix/4c87f20badee2c5110c23005984830cd

and the tidb-server terminates with the output:

2016/10/28 18:55:27 server.go:167: [error] accept error accept tcp [::]:4000: accept4: too many open files
2016/10/28 18:55:28 main.go:145: [error] accept tcp [::]:4000: accept4: too many open files

(The other two services keep running...)

I second @eddyb 's suggestion of trying to use rr to look into this on your end.

@siddontang
Copy link

Hi @pnkfelix
Sorry that I forgot give you the bank source, it is here https://gist.github.com/siddontang/2fb9cd4cf736199b0b017f7263c41ab4

In the bank case, we will create at lease concurrency(default is 10000) connections.
Seem that you should use ulimit -n 10240 to ensure we can open so many connections :-)

Of course, you can use another concurrency in bank like ./bank --concurrency 2000, this will only use 2000 connections.

@brson
Copy link
Contributor

brson commented Nov 3, 2016

Thanks for the new info @siddontang. @pnkfelix maybe you can try again?

@pnkfelix pnkfelix self-assigned this Nov 3, 2016
@siddontang
Copy link

Hi @pnkfelix, can you reproduce it?

@brson
Copy link
Contributor

brson commented Nov 17, 2016

@rust-lang/compiler this needs a P- tag.

@brson brson added regression-from-stable-to-beta Performance or correctness regression from stable to beta. and removed regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. labels Nov 17, 2016
@nikomatsakis
Copy link
Contributor

triage: P-high

In particular, we should figure out if we can reproduce this or not!

@rust-highfive rust-highfive added P-high High priority and removed I-nominated labels Nov 17, 2016
@arielb1
Copy link
Contributor

arielb1 commented Nov 17, 2016

@BusyJay

If you could reproduce the crash under rr, that would help us a lot.

@BusyJay
Copy link
Author

BusyJay commented Nov 18, 2016

Just tested with rr, but got an unexpected error. I will retry it later once it's resolved.

@sanxiyn
Copy link
Member

sanxiyn commented Nov 19, 2016

rr issue is resolved.

@BusyJay
Copy link
Author

BusyJay commented Nov 19, 2016

Yes, and I have been testing it for more than 12 hours, and it still not crash yet. I guess rr emulates a single-core machine just makes it very slow.

@brson
Copy link
Contributor

brson commented Dec 1, 2016

It's looking like we're unlikely to solve this before release.

@nikomatsakis
Copy link
Contributor

In @rust-lang/compiler meeting, discussed. We're basically still having trouble reproducing the bug (ideally in rr). No real status update.

@brson
Copy link
Contributor

brson commented Dec 15, 2016

@pnkfelix maybe we can reduce this to P-medium until it's clear there's a bug to tackle on the rust side here.

@nikomatsakis
Copy link
Contributor

triage: P-medium

Seeing as how we have not been able to reproduce, and we've basically stalled out, we're going to downgrade this in priority.

@BusyJay, please let us know current status (is this still reproducing for you outside rr?) and if there is anything we can do to help track it down.

@rust-highfive rust-highfive added P-medium Medium priority and removed P-high High priority labels Dec 22, 2016
@nikomatsakis
Copy link
Contributor

Is this now stable-to-stable?

@siddontang
Copy link

Hi @nikomatsakis

Sorry that we don't have enough time to reproduce it. We tested it three weeks ago and this issue still existed. We can test it again after you release the newest nightly version :-)

@nikomatsakis
Copy link
Contributor

@siddontang

Sorry that we don't have enough time to reproduce it.

I'm sorry that we can't reproduce it either. =( Please do give it another try so at least we know if it is still a problem!

@nikomatsakis
Copy link
Contributor

Have we at least narrowed this down to a specific nightly where the problem seems to occur? It seems like it is not due to the switch to MIR, right?

@brson brson added regression-from-stable-to-stable Performance or correctness regression from one stable version to another. and removed regression-from-stable-to-beta Performance or correctness regression from stable to beta. labels Dec 29, 2016
@DemiMarie
Copy link
Contributor

Could this be some sort of data race? Being unable to reduce it under rr sounds like a race condition.

Can you try ThreadSanitizer?

@siddontang
Copy link

@nikomatsakis

We do some test and can't reproduce it with the newest rust + newest TiKV, but to our surprise, we can reproduce it with the newest rust + old TiKV (2016-08-21 version). Our rust version is:

rustc -V
rustc 1.14.0 (e8a012324 2016-12-16)

We don't know why now, maybe the changes in TiKV skip the trigger condition for the core dump, or the problem still exists but we don't meet it sadly.

Now we decide to use the newest rust for TiKV, and if we meet this core dump later, we will update the issue.

@nikomatsakis
Copy link
Contributor

@siddontang ok, well, I'm glad you're not hitting the issue anymore, but I wish we had a better handle on what the problem is exactly. Of course, it is also possible that the bug is in fact in TikV (or some other package featuring unsafe code), so it's quite likely that the problem is indeed fixed by a newer version.

@nikomatsakis
Copy link
Contributor

I'm going to downgrade to P-low until we have more data.

@nikomatsakis
Copy link
Contributor

triage: P-low

@rust-highfive rust-highfive added P-low Low priority and removed P-medium Medium priority labels Jan 2, 2017
@siddontang
Copy link

Hi @nikomatsakis

Sadly, we meet the coredump again with the newest rustc + newest TiKV. 😭

We will try to reproduce it again.

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `bin/tikv-server --addr 0.0.0.0:10160 --advertise-addr 10.2.0.91:10160 --pd 10.2'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000559d7ec9f319 in je_arena_sdalloc (tcache=0x7ffb315ab000, size=114688, ptr=0x559d7f0d0748 <vtable.4C>, tsd=<optimized out>)
    at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/liballoc_jemalloc/../jemalloc/include/jemalloc/internal/arena.h:1439
1439	/buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/liballoc_jemalloc/../jemalloc/include/jemalloc/internal/arena.h: No such file or directory.
warning: Missing auto-load scripts referenced in section .debug_gdb_scripts
of file /home/work/deploy/bin/tikv-server
Use `info auto-load python-scripts [REGEXP]' to list them.
(gdb) bt
#0  0x0000559d7ec9f319 in je_arena_sdalloc (tcache=0x7ffb315ab000, size=114688, ptr=0x559d7f0d0748 <vtable.4C>, tsd=<optimized out>)
    at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/liballoc_jemalloc/../jemalloc/include/jemalloc/internal/arena.h:1439
#1  je_isdalloct (tcache=0x7ffb315ab000, size=114688, ptr=0x559d7f0d0748 <vtable.4C>, tsd=<optimized out>) at include/jemalloc/internal/jemalloc_internal.h:1087
#2  je_isqalloc (tcache=0x7ffb315ab000, size=114688, ptr=0x559d7f0d0748 <vtable.4C>, tsd=<optimized out>) at include/jemalloc/internal/jemalloc_internal.h:1097
#3  isfree (tcache=0x7ffb315ab000, usize=114688, ptr=0x559d7f0d0748 <vtable.4C>, tsd=<optimized out>)
    at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/liballoc_jemalloc/../jemalloc/src/jemalloc.c:1842
#4  sdallocx (ptr=0x559d7f0d0748 <vtable.4C>, size=<optimized out>, flags=<optimized out>)
    at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/liballoc_jemalloc/../jemalloc/src/jemalloc.c:2532
#5  0x0000559d7e75ee7f in deallocate (ptr=0x559d7f0d0748 <vtable.4C> "p\267]~\235U", old_size=112858, align=1)
    at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/liballoc/heap.rs:113
#6  drop<u8> (self=<optimized out>) at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/liballoc/raw_vec.rs:552
#7  tikv::server::coprocessor::endpoint::{{impl}}::run_batch (self=0x7ffb155fdce0, tasks=<optimized out>) at /home/pingcap/cwen/tikv/src/server/coprocessor/endpoint.rs:183
#8  0x0000559d7e58e78a in poll<tikv::server::coprocessor::endpoint::Host,tikv::server::coprocessor::endpoint::Task> (batch_size=50, runner=..., rx=..., counter=...)
    at /home/pingcap/cwen/tikv/src/util/worker/mod.rs:161
#9  {{closure}}<tikv::server::coprocessor::endpoint::Task,tikv::server::coprocessor::endpoint::Host> () at /home/pingcap/cwen/tikv/src/util/worker/mod.rs:196
#10 call_once<(),closure> (self=..., _args=<optimized out>) at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libstd/panic.rs:295

@nikomatsakis
Copy link
Contributor

@siddontang argh, sorry to hear that. :(

@siddontang
Copy link

Hi @nikomatsakis @pnkfelix

A strange update, after we merge tikv/tikv#1512, we find that using newest rust is ok, we run many tests for a long time and the core dump doesn't happen, so we guess this PR fixes the problem, but we don't know why, could you help us to find the reason?

We used nightly-2016-08-06 before, so I think the bug is introduced after this version.

@brson
Copy link
Contributor

brson commented Feb 9, 2017

@thanks for the continued investigations and update @siddontang .

@Mark-Simulacrum Mark-Simulacrum added the C-bug Category: This is a bug. label Jul 26, 2017
@pnkfelix
Copy link
Member

unassigning self.

I'm not sure we can reasonably expect to determine the underlying problem that has either been fixed or masked, since as far as I can tell, no one working on the rustc compiler has locally reproduced the problem.

@steveklabnik
Copy link
Member

It’s been over a year since any update, and almost two years since this issue was originally reported. I’m going to go ahead and close this, @BusyJay, if you have a way to reproduce and still care about this, please let me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug. I-crash Issue: The compiler crashes (SIGSEGV, SIGABRT, etc). Use I-ICE instead when the compiler panics. P-low Low priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests