A 3-nodes cluster of YugaByte doesn't tolerate an isolation of a single node #19

rystsov · 2018-01-24T18:06:15Z

Hi folks,

I was testing how a 3-nodes cluster of YugaByte behaves (from a client perspective) when one of the nodes is isolated from the peers and observed an anomaly: the whole system became unavailable until a connection was restored.

Repo with Docker+JavaScript scripts to reproduce the problem: https://github.com/rystsov/perseus/tree/master/yugabyte

Testing environment

OS: Ubuntu 17.10
YugaByte: v0.9.1.0
NodeJS: v8.9.4
Redis's "redis" driver: "2.8.0"

A client executes three coroutines each of them connects to a corresponding YugaByte's node (yuga1, yuga2 and yuga3) and in a loop:

Reads a value from a db
Increments it
Writes it back

Each coroutine uses a different key to avoid collisions.

The client aggregates a number of successful iterations and dumps it every second:

#legend: time|yuga1|yuga2|yuga3|yuga1:err|yuga2:err|yuga3:err
1	70	68	118	0	0	0	2018/01/24 04:27:40
2	82	85	141	0	0	0	2018/01/24 04:27:41

The first column is the number of passed seconds since the beginning of the experiment; the next three columns contain counter of successful iterations of the clients located on each node per last second, following three columns are counters of errors and the last one is time.

After I isolated a yuga3 node the system became unavailable at least for 366 seconds:

#legend: time|yuga1|yuga2|yuga3|yuga1:err|yuga2:err|yuga3:err
1	70	68	118	0	0	0	2018/01/24 04:27:40
2	82	85	141	0	0	0	2018/01/24 04:27:41
...
12	82	68	153	0	0	0	2018/01/24 04:27:51
13	93	81	159	0	0	0	2018/01/24 04:27:52
# isolating yuga3
# isolated yuga3
14	78	78	124	0	0	0	2018/01/24 04:27:53
15	0	0	0	0	0	0	2018/01/24 04:27:54
...
380	0	0	0	0	0	0	2018/01/24 04:34:00
381	0	0	0	0	0	0	2018/01/24 04:34:01
# rejoining yuga3
# rejoined yuga3
382	0	0	0	0	0	0	2018/01/24 04:34:02
383	0	0	0	0	0	0	2018/01/24 04:34:03
...
432	0	0	0	0	0	0	2018/01/24 04:34:52
433	0	0	0	0	0	0	2018/01/24 04:34:53
434	118	111	77	1	1	1	2018/01/24 04:34:54
435	219	166	80	0	0	0	2018/01/24 04:34:55

After I reestablished a connection, the system recovered in 51 seconds.

The logs: dead.tar.gz

The text was updated successfully, but these errors were encountered:

kmuthukk · 2018-01-24T23:00:39Z

Hi @amitanandaiyer - assigning this to you for initial look.

Thanks @rystsov for reporting the issue (with clear steps for repro). Much appreciated.

amitanandaiyer · 2018-01-26T18:31:39Z

Hi Denis,
Thanks for reporting this issue. I'm trying to repro what's happening on your end. But, haven't been able to get a similar result as yours.

A few questions regarding the repro:

Was this something you were able to see consistently? Or something you stumbled upon on various tries?
If you are able to repro consistently, would you mind adding --v=3 option to the db-yb-tserver/master.sh files.

Thanks,
-Amit
client1.log

rystsov · 2018-01-26T18:55:31Z

I was able to reproduce it with roughly 15% rate of success. Maybe it depends on the node you're trying to isolate. I observed the issue two times and both time I isolated a node with the higher number of RPS:

12	82	68	153	0	0	0	2018/01/24 04:27:51
13	93	81	159	0	0	0	2018/01/24 04:27:52

In this case, it's the third node. In my other try, it was the second as far as remember.

Between the tries, I stopped the containers and removed them (docker rm yuga1 yuga2 yuga3) along with the logs to have a fresh start each time.

I'll try to add the parameter in the evening --v=3. Just out of curiosity, what does it do?

amitanandaiyer · 2018-01-26T19:03:15Z

--v=3 is going to turn on the verbose logging, up to level 3. Hoping to see what's happening from the redis proxy layer (the shim layer listening on 6379 and converting requests to the yugabyte db) does.

amitanandaiyer · 2018-01-26T19:03:51Z

Thanks for the tip about stopping the containers and removing them. Will try doing that between retries.

amitanandaiyer · 2018-01-26T19:41:38Z

I managed to repro the issue, partially, by setting '--yb_num_total_tablets 1' in enable-redis.sh and taking out the node that's the active leader.

While I don't see all 3 nodes going to 0. Two nodes' ops go down to 0 -- instead of just the One node that was taken down.

Will analyse the logs and update my findings here.

amitanandaiyer · 2018-01-27T17:17:41Z

So this is what's happening.

Denis is writing to 3 rows; which can go to 3 tablets served by 3 leaders. Probabilistically, all 3 rows could end up on being served by tablets on one node.

Let's say that yuga3 is being the leader for all the 3 tablets. And that yuga3 is being taken down.

At the Tserver/YB layer, the remaining peers detect that the leader is missing (in < 2 sec) and run a leader election to make progress.
However, the Redis proxies on all 3 nodes -- which were previously sending their requests to yuga3 -- are still waiting on yuga3 for a response.

Due to the manner in which the failure is caused (iptables drop packets vs bringing yuga3 down), pre-established connections between the proxy and the tservers are not getting torn down. Requests wait for their timeout (default 60s), and return an error. However, since the deadline is past, they do not attempt a retry -- leaving the metacache still pointing to yuga3, and routing future requests to the same fate.

I have a patch that allows unsuccessful/timed-out operations to mark the leader as being suspect. This would cause the 2 available proxies to realize the leader movement, and get their ops back once the first RPC times out.

amitanandaiyer · 2018-02-09T23:25:05Z

@rystsov This issue has been fixed in 2ee52b6

If you download the file, and try again, you should see that things recover by redis_service_yb_client_timeout_millis (which defaults to 60sec).

To make the connection detection/recovery independent of the API timeout, I'm working on another diff to enable TCP/KeepAlive on our connections. That is in the works, and currently being tested.

Add information on recent features. Includes: - Added info on instance tags - AWS backup/restore page.

Summary: Some time ago we had optimizations disabled for debug build type, but it was enabled during fix of #1291: f710367 ( https://phabricator.dev.yugabyte.com/D6660 ). Now we no longer have `retryable_rpc_single_call_timeout_ms` flag, also optimizations in debug build make it harder to investigate issues because of optimized stack traces and variables. So, we can disable these optimizations again to make debugging easier. Before (note <optimized out> values that are not available for debugging): ``` (gdb) bt #0 0x00007f97990bfa6b in raise (sig=sig@entry=11) at ../sysdeps/unix/sysv/linux/pt-raise.c:35 #1 0x00007f97a45268b9 in AddHash (num_probes=6, total_bits=523776, num_lines=<optimized out>, data=0x2bdc000 "", h=4266458700) at ../../src/yb/rocksdb/util/bloom.cc:66 #2 rocksdb::(anonymous namespace)::FixedSizeFilterBitsBuilder::AddKey (this=<optimized out>, key=...) at ../../src/yb/rocksdb/util/bloom.cc:463 #3 0x00007f97a44f0818 in rocksdb::FixedSizeFilterBlockBuilder::AddKey (this=this@entry=0x1c87a40, key=...) at ../../src/yb/rocksdb/table/fixed_size_filter_block.cc:97 #4 0x00007f97a44f08b0 in rocksdb::FixedSizeFilterBlockBuilder::Add (this=0x1c87a40, key=...) at ../../src/yb/rocksdb/table/fixed_size_filter_block.cc:91 #5 0x00007f97a44cf295 in rocksdb::BlockBasedTableBuilder::Add (this=0x1cd1c00, key=..., value=...) at ../../src/yb/rocksdb/table/block_based_table_builder.cc:468 #6 0x00007f97a439b9a4 in rocksdb::BuildTable (dbname=..., env=0x7f97a48a2c00 <rocksdb::Env::Default()::default_env>, ioptions=..., env_options=..., table_cache=0x1ccf740, iter=0x7f97969858f8, meta=0x7f97969864a0, internal_comparator= std::shared_ptr<const rocksdb::InternalKeyComparator> (use count 3, weak count 0) = {...}, int_tbl_prop_collector_factories=std::vector of length 1, capacity 1 = {...}, column_family_id=0, snapshots=std::vector of length 0, capacity 0, earliest_write_conflict_snapshot=72057594037927935, compression=rocksdb::kSnappyCompression, compression_opts=..., paranoid_file_checks=false, internal_stats=0x1d40200, boundary_values_extractor=0x1cd03f0, io_priority=rocksdb::Env::IO_HIGH, table_properties=0x7f9796987000) at ../../src/yb/rocksdb/db/builder.cc:160 #7 0x00007f97a4444ced in rocksdb::FlushJob::WriteLevel0Table (this=this@entry=0x7f9796986f40, mems=..., edit=0x1d44268, meta=meta@entry=0x7f97969864a0) at ../../src/yb/rocksdb/db/flush_job.cc:290 #8 0x00007f97a444669c in rocksdb::FlushJob::Run (this=this@entry=0x7f9796986f40, file_meta=file_meta@entry=0x7f9796986d00) at ../../src/yb/rocksdb/db/flush_job.cc:191 #9 0x00007f97a43fb5ba in rocksdb::DBImpl::FlushMemTableToOutputFile (this=this@entry=0x1d24000, cfd=cfd@entry=0x1a7b000, mutable_cf_options=..., made_progress=made_progress@entry=0x7f9796987f47, job_context=job_context@entry=0x7f9796987d70, log_buffer=0x7f9796987480) at ../../src/yb/rocksdb/db/db_impl.cc:1873 #10 0x00007f97a43fc505 in rocksdb::DBImpl::BackgroundFlush (this=this@entry=0x1d24000, made_progress=made_progress@entry=0x7f9796987f47, job_context=job_context@entry=0x7f9796987d70, log_buffer=log_buffer@entry=0x7f9796987480, cfd=0x1a7b000, cfd@entry=0x0) at ../../src/yb/rocksdb/db/db_impl.cc:3202 #11 0x00007f97a4406cb3 in rocksdb::DBImpl::BackgroundCallFlush (this=this@entry=0x1d24000, cfd=cfd@entry=0x0) at ../../src/yb/rocksdb/db/db_impl.cc:3276 #12 0x00007f97a4406f6d in rocksdb::DBImpl::BGWorkFlush (db=db@entry=0x1d24000) at ../../src/yb/rocksdb/db/db_impl.cc:3132 #13 0x00007f97a4540875 in rocksdb::ThreadPool::BGThread (this=0x1adeb60, thread_id=0) at ../../src/yb/rocksdb/util/thread_posix.cc:126 #14 0x00007f97a4540899 in operator() (__closure=<optimized out>) at ../../src/yb/rocksdb/util/thread_posix.cc:165 #15 std::_Function_handler<void(), rocksdb::ThreadPool::StartBGThreads()::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /opt/yb-build/brew/linuxbrew-20181203T161736v9-3ba4c2ed9b0587040949a4a9a95b576f520bae/Cellar/gcc/5.5.0_4/include/c++/5.5.0/functional:1871 #16 0x00007f979d804626 in operator() (this=0x1cbdc78) at /opt/yb-build/brew/linuxbrew-20181203T161736v9-3ba4c2ed9b0587040949a4a9a95b576f520bae/Cellar/gcc/5.5.0_4/include/c++/5.5.0/functional:2267 #17 yb::Thread::SuperviseThread (arg=0x1cbdc20) at ../../src/yb/util/thread.cc:771 #18 0x00007f97990b7694 in start_thread (arg=0x7f9796990700) at pthread_create.c:333 #19 0x00007f9798df941d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 ``` After: ``` #0 0x00007f1a9109ba6b in raise (sig=11) at ../sysdeps/unix/sysv/linux/pt-raise.c:35 #1 0x00007f1a9edd7343 in rocksdb::(anonymous namespace)::AddHash (h=4266458700, data=0x2d58000 "", num_lines=1023, total_bits=523776, num_probes=6) at ../../src/yb/rocksdb/util/bloom.cc:66 #2 0x00007f1a9edd87c8 in rocksdb::(anonymous namespace)::FixedSizeFilterBitsBuilder::AddKey (this=0x2d30040, key=...) at ../../src/yb/rocksdb/util/bloom.cc:463 #3 0x00007f1a9ed8e757 in rocksdb::FixedSizeFilterBlockBuilder::AddKey (this=0x1ef22d0, key=...) at ../../src/yb/rocksdb/table/fixed_size_filter_block.cc:97 #4 0x00007f1a9ed8e6eb in rocksdb::FixedSizeFilterBlockBuilder::Add (this=0x1ef22d0, key=...) at ../../src/yb/rocksdb/table/fixed_size_filter_block.cc:91 #5 0x00007f1a9ed5b2ea in rocksdb::BlockBasedTableBuilder::Add (this=0x1e4dc00, key=..., value=...) at ../../src/yb/rocksdb/table/block_based_table_builder.cc:468 #6 0x00007f1a9eb80bd4 in rocksdb::BuildTable (dbname="/tmp/yb_tests__2020-12-14T18_15_29__23449.18214.17918/mytestdb-814110369", env=0x7f1a9f35fe40 <rocksdb::Env::Default()::default_env>, ioptions=..., env_options=..., table_cache=0x1e4b740, iter=0x7f1a8e961858, meta=0x7f1a8e962480, internal_comparator=std::shared_ptr<const rocksdb::InternalKeyComparator> (use count 3, weak count 0) = {...}, int_tbl_prop_collector_factories=std::vector of length 1, capacity 1 = {...}, column_family_id=0, snapshots=std::vector of length 0, capacity 0, earliest_write_conflict_snapshot=72057594037927935, compression=rocksdb::kSnappyCompression, compression_opts=..., paranoid_file_checks=false, internal_stats=0x1ebc200, boundary_values_extractor=0x1e4c3f0, io_priority=rocksdb::Env::IO_HIGH, table_properties=0x7f1a8e962f70) at ../../src/yb/rocksdb/db/builder.cc:160 #7 0x00007f1a9ec8e56b in rocksdb::FlushJob::WriteLevel0Table (this=0x7f1a8e962eb0, mems=..., edit=0x1ec0268, meta=0x7f1a8e962480) at ../../src/yb/rocksdb/db/flush_job.cc:290 #8 0x00007f1a9ec8d767 in rocksdb::FlushJob::Run (this=0x7f1a8e962eb0, file_meta=0x7f1a8e962c70) at ../../src/yb/rocksdb/db/flush_job.cc:191 #9 0x00007f1a9ec10c56 in rocksdb::DBImpl::FlushMemTableToOutputFile (this=0x1ea0000, cfd=0x1bf7000, mutable_cf_options=..., made_progress=0x7f1a8e9640b7, job_context=0x7f1a8e963ee0, log_buffer=0x7f1a8e9635f0) at ../../src/yb/rocksdb/db/db_impl.cc:1873 #10 0x00007f1a9ec18bb6 in rocksdb::DBImpl::BackgroundFlush (this=0x1ea0000, made_progress=0x7f1a8e9640b7, job_context=0x7f1a8e963ee0, log_buffer=0x7f1a8e9635f0, cfd=0x1bf7000) at ../../src/yb/rocksdb/db/db_impl.cc:3202 #11 0x00007f1a9ec1914d in rocksdb::DBImpl::BackgroundCallFlush (this=0x1ea0000, cfd=0x0) at ../../src/yb/rocksdb/db/db_impl.cc:3276 #12 0x00007f1a9ec182fa in rocksdb::DBImpl::BGWorkFlush (db=0x1ea0000) at ../../src/yb/rocksdb/db/db_impl.cc:3132 #13 0x00007f1a9ee02747 in rocksdb::ThreadPool::BGThread (this=0x1c5ab60, thread_id=0) at ../../src/yb/rocksdb/util/thread_posix.cc:126 #14 0x00007f1a9ee028c6 in rocksdb::ThreadPool::<lambda()>::operator()(void) const (__closure=0x1e39c78) at ../../src/yb/rocksdb/util/thread_posix.cc:165 #15 0x00007f1a9ee03140 in std::_Function_handler<void(), rocksdb::ThreadPool::StartBGThreads()::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /opt/yb-build/brew/linuxbrew-20181203T161736v9-3ba4c2ed9b0587040949a4a9a95b576f520bae/Cellar/gcc/5.5.0_4/include/c++/5.5.0/functional:1871 #16 0x00007f1aa1efe732 in std::function<void ()>::operator()() const (this=0x1e39c78) at /opt/yb-build/brew/linuxbrew-20181203T161736v9-3ba4c2ed9b0587040949a4a9a95b576f520bae/Cellar/gcc/5.5.0_4/include/c++/5.5.0/functional:2267 #17 0x00007f1a95e6bcf9 in yb::Thread::SuperviseThread (arg=0x1e39c20) at ../../src/yb/util/thread.cc:771 #18 0x00007f1a91093694 in start_thread (arg=0x7f1a8e96c700) at pthread_create.c:333 #19 0x00007f1a90dd541d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 ``` Test Plan: ``` #!/usr/bin/env bash set -euo pipefail for i in {1..50} do echo "Iteration: $i" rm -rf build/debug-gcc-dynamic-ninja/share/initial_sys_catalog_snapshot ybd --sj packaged ./bin/yb-ctl destroy ./bin/yb-ctl start ./bin/ysqlsh -c "SELECT 1" done ./bin/yb-ctl destroy ``` Reviewers: bogdan, sergei, dmitry, mbautin Reviewed By: mbautin Subscribers: eng Differential Revision: https://phabricator.dev.yugabyte.com/D10121

Summary: YB Seq Scan code path is not hit because Foreign Scan is the default and pg hint plan does not work. Upcoming merge with YB master will bring in master commit 465ee2c which changes the default to YB Seq Scan. To test YB Seq Scan, a temporary patch is needed (see the test plan). With that, two bugs are encountered: fix them. 1. FailedAssertion("TTS_IS_VIRTUAL(slot)" On simple test case create table t (i int primary key, j int); select * from t; get TRAP: FailedAssertion("TTS_IS_VIRTUAL(slot)", File: "../../../../../../../src/postgres/src/backend/access/yb_access/yb_scan.c", Line: 3473, PID: 2774450) Details: #0 0x00007fd52616eacf in raise () from /lib64/libc.so.6 #1 0x00007fd526141ea5 in abort () from /lib64/libc.so.6 #2 0x0000000000af33ad in ExceptionalCondition (conditionName=conditionName@entry=0xc2938d "TTS_IS_VIRTUAL(slot)", errorType=errorType@entry=0xc01498 "FailedAssertion", fileName=fileName@entry=0xc28f18 "../../../../../../../src/postgres/src/backend/access/yb_access/yb_scan.c", lineNumber=lineNumber@entry=3473) at ../../../../../../../src/postgres/src/backend/utils/error/assert.c:69 #3 0x00000000005c26bd in ybFetchNext (handle=0x2600ffc43680, slot=slot@entry=0x2600ff6c2980, relid=16384) at ../../../../../../../src/postgres/src/backend/access/yb_access/yb_scan.c:3473 #4 0x00000000007de444 in YbSeqNext (node=0x2600ff6c2778) at ../../../../../../src/postgres/src/backend/executor/nodeYbSeqscan.c:156 #5 0x000000000078b3c6 in ExecScanFetch (node=node@entry=0x2600ff6c2778, accessMtd=accessMtd@entry=0x7de2b9 <YbSeqNext>, recheckMtd=recheckMtd@entry=0x7de26e <YbSeqRecheck>) at ../../../../../../src/postgres/src/backend/executor/execScan.c:133 #6 0x000000000078b44e in ExecScan (node=0x2600ff6c2778, accessMtd=accessMtd@entry=0x7de2b9 <YbSeqNext>, recheckMtd=recheckMtd@entry=0x7de26e <YbSeqRecheck>) at ../../../../../../src/postgres/src/backend/executor/execScan.c:182 #7 0x00000000007de298 in ExecYbSeqScan (pstate=<optimized out>) at ../../../../../../src/postgres/src/backend/executor/nodeYbSeqscan.c:191 #8 0x00000000007871ef in ExecProcNodeFirst (node=0x2600ff6c2778) at ../../../../../../src/postgres/src/backend/executor/execProcnode.c:480 #9 0x000000000077db0e in ExecProcNode (node=0x2600ff6c2778) at ../../../../../../src/postgres/src/include/executor/executor.h:285 #10 ExecutePlan (execute_once=<optimized out>, dest=0x2600ff6b1a10, direction=<optimized out>, numberTuples=0, sendTuples=true, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x2600ff6c2778, estate=0x2600ff6c2128) at ../../../../../../src/postgres/src/backend/executor/execMain.c:1650 #11 standard_ExecutorRun (queryDesc=0x2600ff675128, direction=<optimized out>, count=0, execute_once=<optimized out>) at ../../../../../../src/postgres/src/backend/executor/execMain.c:367 #12 0x000000000077dbfe in ExecutorRun (queryDesc=queryDesc@entry=0x2600ff675128, direction=direction@entry=ForwardScanDirection, count=count@entry=0, execute_once=<optimized out>) at ../../../../../../src/postgres/src/backend/executor/execMain.c:308 #13 0x0000000000982617 in PortalRunSelect (portal=portal@entry=0x2600ff90e128, forward=forward@entry=true, count=0, count@entry=9223372036854775807, dest=dest@entry=0x2600ff6b1a10) at ../../../../../../src/postgres/src/backend/tcop/pquery.c:954 #14 0x000000000098433c in PortalRun (portal=portal@entry=0x2600ff90e128, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x2600ff6b1a10, altdest=altdest@entry=0x2600ff6b1a10, qc=0x7fffc14a13c0) at ../../../../../../src/postgres/src/backend/tcop/pquery.c:786 #15 0x000000000097e65b in exec_simple_query (query_string=0x2600ffdc6128 "select * from t;") at ../../../../../../src/postgres/src/backend/tcop/postgres.c:1321 #16 yb_exec_simple_query_impl (query_string=query_string@entry=0x2600ffdc6128) at ../../../../../../src/postgres/src/backend/tcop/postgres.c:5060 #17 0x000000000097b7a5 in yb_exec_query_wrapper_one_attempt (exec_context=exec_context@entry=0x2600ffdc6000, restart_data=restart_data@entry=0x7fffc14a1640, functor=functor@entry=0x97e033 <yb_exec_simple_query_impl>, functor_context=functor_context@entry=0x2600ffdc6128, attempt=attempt@entry=0, retry=retry@entry=0x7fffc14a15ff) at ../../../../../../src/postgres/src/backend/tcop/postgres.c:5028 #18 0x000000000097d077 in yb_exec_query_wrapper (exec_context=exec_context@entry=0x2600ffdc6000, restart_data=restart_data@entry=0x7fffc14a1640, functor=functor@entry=0x97e033 <yb_exec_simple_query_impl>, functor_context=functor_context@entry=0x2600ffdc6128) at ../../../../../../src/postgres/src/backend/tcop/postgres.c:5052 #19 0x000000000097d0ca in yb_exec_simple_query (query_string=query_string@entry=0x2600ffdc6128 "select * from t;", exec_context=exec_context@entry=0x2600ffdc6000) at ../../../../../../src/postgres/src/backend/tcop/postgres.c:5075 #20 0x000000000097fe8a in PostgresMain (dbname=<optimized out>, username=<optimized out>) at ../../../../../../src/postgres/src/backend/tcop/postgres.c:5794 #21 0x00000000008c8354 in BackendRun (port=0x2600ff8423c0) at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:4791 #22 BackendStartup (port=0x2600ff8423c0) at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:4491 #23 ServerLoop () at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:1878 #24 0x00000000008caa55 in PostmasterMain (argc=argc@entry=25, argv=argv@entry=0x2600ffdc01a0) at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:1533 #25 0x0000000000804ba8 in PostgresServerProcessMain (argc=25, argv=0x2600ffdc01a0) at ../../../../../../src/postgres/src/backend/main/main.c:208 #26 0x0000000000804bc8 in main () 3469 ybFetchNext(YBCPgStatement handle, 3470 TupleTableSlot *slot, Oid relid) 3471 { 3472 Assert(slot != NULL); 3473 Assert(TTS_IS_VIRTUAL(slot)); (gdb) p *slot $2 = {type = T_TupleTableSlot, tts_flags = 18, tts_nvalid = 0, tts_ops = 0xeaf5e0 <TTSOpsHeapTuple>, tts_tupleDescriptor = 0x2600ff6416c0, tts_values = 0x2600ff6c2a00, tts_isnull = 0x2600ff6c2a10, tts_mcxt = 0x2600ff6c2000, tts_tid = {ip_blkid = {bi_hi = 0, bi_lo = 0}, ip_posid = 0, yb_item = {ybctid = 0}}, tts_tableOid = 0, tts_yb_insert_oid = 0} Fix by making YB Seq Scan always use virtual slot. This is similar to what is done for YB Foreign Scan. 2. segfault in ending scan Same simple test case gives segfault at a later stage. Details: #0 0x00000000007de762 in table_endscan (scan=0x3debfe3ab88) at ../../../../../../src/postgres/src/include/access/tableam.h:997 #1 ExecEndYbSeqScan (node=node@entry=0x3debfe3a778) at ../../../../../../src/postgres/src/backend/executor/nodeYbSeqscan.c:298 #2 0x0000000000787a75 in ExecEndNode (node=0x3debfe3a778) at ../../../../../../src/postgres/src/backend/executor/execProcnode.c:649 #3 0x000000000077ffaf in ExecEndPlan (estate=0x3debfe3a128, planstate=<optimized out>) at ../../../../../../src/postgres/src/backend/executor/execMain.c:1489 #4 standard_ExecutorEnd (queryDesc=0x2582fdc88928) at ../../../../../../src/postgres/src/backend/executor/execMain.c:503 #5 0x00000000007800f8 in ExecutorEnd (queryDesc=queryDesc@entry=0x2582fdc88928) at ../../../../../../src/postgres/src/backend/executor/execMain.c:474 #6 0x00000000006f140c in PortalCleanup (portal=0x2582ff900128) at ../../../../../../src/postgres/src/backend/commands/portalcmds.c:305 #7 0x0000000000b3c36a in PortalDrop (portal=portal@entry=0x2582ff900128, isTopCommit=isTopCommit@entry=false) at ../../../../../../../src/postgres/src/backend/utils/mmgr/portalmem.c:514 #8 0x000000000097e667 in exec_simple_query (query_string=0x2582ffdc6128 "select * from t;") at ../../../../../../src/postgres/src/backend/tcop/postgres.c:1331 #9 yb_exec_simple_query_impl (query_string=query_string@entry=0x2582ffdc6128) at ../../../../../../src/postgres/src/backend/tcop/postgres.c:5060 #10 0x000000000097b79a in yb_exec_query_wrapper_one_attempt (exec_context=exec_context@entry=0x2582ffdc6000, restart_data=restart_data@entry=0x7ffc81c0e7d0, functor=functor@entry=0x97e028 <yb_exec_simple_query_impl>, functor_context=functor_context@entry=0x2582ffdc6128, attempt=attempt@entry=0, retry=retry@entry=0x7ffc81c0e78f) at ../../../../../../src/postgres/src/backend/tcop/postgres.c:5028 #11 0x000000000097d06c in yb_exec_query_wrapper (exec_context=exec_context@entry=0x2582ffdc6000, restart_data=restart_data@entry=0x7ffc81c0e7d0, functor=functor@entry=0x97e028 <yb_exec_simple_query_impl>, functor_context=functor_context@entry=0x2582ffdc6128) at ../../../../../../src/postgres/src/backend/tcop/postgres.c:5052 #12 0x000000000097d0bf in yb_exec_simple_query (query_string=query_string@entry=0x2582ffdc6128 "select * from t;", exec_context=exec_context@entry=0x2582ffdc6000) at ../../../../../../src/postgres/src/backend/tcop/postgres.c:5075 #13 0x000000000097fe7f in PostgresMain (dbname=<optimized out>, username=<optimized out>) at ../../../../../../src/postgres/src/backend/tcop/postgres.c:5794 #14 0x00000000008c8349 in BackendRun (port=0x2582ff8403c0) at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:4791 #15 BackendStartup (port=0x2582ff8403c0) at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:4491 #16 ServerLoop () at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:1878 #17 0x00000000008caa4a in PostmasterMain (argc=argc@entry=25, argv=argv@entry=0x2582ffdc01a0) at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:1533 #18 0x0000000000804b9d in PostgresServerProcessMain (argc=25, argv=0x2582ffdc01a0) at ../../../../../../src/postgres/src/backend/main/main.c:208 #19 0x0000000000804bbd in main () 294 /* 295 * close heap scan 296 */ 297 if (tsdesc != NULL) 298 table_endscan(tsdesc); Reason is initial merge 55782d5 incorrectly merges end of ExecEndYbSeqScan. Upstream PG 9ddef36278a9f676c07d0b4d9f33fa22e48ce3b5 removes code, but initial merge duplicates lines. Remove those lines. Test Plan: Apply the following patch to activate YB Seq Scan: diff --git a/src/postgres/src/backend/optimizer/path/allpaths.c b/src/postgres/src/backend/optimizer/path/allpaths.c index 8a4c38a965..854d84a648 100644 --- a/src/postgres/src/backend/optimizer/path/allpaths.c +++ b/src/postgres/src/backend/optimizer/path/allpaths.c @@ -576,7 +576,7 @@ set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, else { /* Plain relation */ - if (IsYBRelationById(rte->relid)) + if (false) { /* * Using a foreign scan which will use the YB FDW by On almalinux 8, ./yb_build.sh fastdebug --gcc11 pg15_tests/run_all_tests.sh fastdebug --gcc11 --sj --sp --scb fails the following tests: - test_D29546 - test_pg15_regress: yb_pg15 - test_types_geo: yb_pg_box - test_hash_in_queries: yb_hash_in_queries Manually check to see that they are due to YB Seq Scan explain output differences. Reviewers: aagrawal, tfoucher Reviewed By: tfoucher Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D31139

kmuthukk assigned kmuthukk and amitanandaiyer and unassigned kmuthukk Jan 24, 2018

rkarthik007 added this to To Do in Fault Tolerance via automation Jan 26, 2018

rkarthik007 moved this from To Do to In progress in Fault Tolerance Jan 26, 2018

rkarthik007 added the kind/bug This issue is a bug label Jan 27, 2018

amitanandaiyer closed this as completed Feb 9, 2018

Fault Tolerance automation moved this from In progress to Done Feb 9, 2018

mbautin mentioned this issue Apr 9, 2019

[YCQL] Undefined behavior (5e+308 outside of range for float) in QLTestSelectedExpr.TestCastDecimal #1182

Closed

hectorgcr mentioned this issue Apr 9, 2019

TSAN warning signal-unsafe call inside of a signal inside in Proxy::SyncRequest #1177

Closed

OlegLoginov mentioned this issue Apr 23, 2019

ASAN:undefined-behavior issue in QLTestSelectedExpr_TestCastDecimal #1241

Closed

This was referenced May 6, 2019

TSAN Failure in QLDmlTest.ReadFollower test #1321

Closed

TSAN failure in QLTransactionTest.CorrectStatusRequestBatching test #1322

Closed

mbautin pushed a commit that referenced this issue Jun 20, 2019

#19 Add info on aws instance tags. (#31)

7620ac7

Add information on recent features. Includes: - Added info on instance tags - AWS backup/restore page.

rao-vasireddy mentioned this issue Sep 18, 2019

core dump in call_home when tserver is out of memory #2371

Closed

d-uspenskiy mentioned this issue Nov 7, 2019

[YSQL] A large number of YSQL test failures in macOS debug mode #2509

Closed

mbautin mentioned this issue Jul 22, 2020

The undo_submodule_changes.sh script does not work on git version 1.8.3.1 #5189

Open

daniel-yb mentioned this issue Sep 1, 2020

[ybase] Core dump occurring for k8s universes #5561

Closed

mbautin mentioned this issue Oct 30, 2020

ASAN in trace code in pggate_test_catalog when built with devtoolset-8 gcc-8 #6227

Open

mbautin mentioned this issue Dec 5, 2020

Undefined behavior in uloc.cpp in icu4c #6560

Open

rao-vasireddy mentioned this issue Jul 1, 2021

[docdb] tserver core dump when restoring using PITR #9162

Closed

sushantrmishra mentioned this issue May 25, 2022

[YSQL] Unable to terminate the PostgreSQL backend #11343

Closed

shantanugupta-yb mentioned this issue Jun 23, 2022

[YSQL] Segmentation faults observed while restarting the postmaster process #13015

Open

rthallamko3 mentioned this issue Aug 17, 2022

[DocDB] Some client connections hang after master leader is restarted #13659

Closed

hari90 mentioned this issue Apr 22, 2024

[DocDB] e6bb62ab80a9c419df8a2807e2fe182cd2d7aa58/D34307 introduced a TSAN race #22101

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A 3-nodes cluster of YugaByte doesn't tolerate an isolation of a single node #19

A 3-nodes cluster of YugaByte doesn't tolerate an isolation of a single node #19

rystsov commented Jan 24, 2018 •

edited

kmuthukk commented Jan 24, 2018 •

edited

amitanandaiyer commented Jan 26, 2018

rystsov commented Jan 26, 2018

amitanandaiyer commented Jan 26, 2018

amitanandaiyer commented Jan 26, 2018

amitanandaiyer commented Jan 26, 2018

amitanandaiyer commented Jan 27, 2018

amitanandaiyer commented Feb 9, 2018

A 3-nodes cluster of YugaByte doesn't tolerate an isolation of a single node #19

A 3-nodes cluster of YugaByte doesn't tolerate an isolation of a single node #19

Comments

rystsov commented Jan 24, 2018 • edited

kmuthukk commented Jan 24, 2018 • edited

amitanandaiyer commented Jan 26, 2018

rystsov commented Jan 26, 2018

amitanandaiyer commented Jan 26, 2018

amitanandaiyer commented Jan 26, 2018

amitanandaiyer commented Jan 26, 2018

amitanandaiyer commented Jan 27, 2018

amitanandaiyer commented Feb 9, 2018

rystsov commented Jan 24, 2018 •

edited

kmuthukk commented Jan 24, 2018 •

edited