Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YSQL] Backend terminates without notifying client of session termination #2479

Open
ajcaldera1 opened this issue Oct 1, 2019 · 2 comments
Assignees
Labels
area/ysql Yugabyte SQL (YSQL) kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue
Projects

Comments

@ajcaldera1
Copy link
Contributor

ajcaldera1 commented Oct 1, 2019

Jira Link: DB-2472
When running a SELECT count(*) query against a table with a large number of rows, the PG backend process can terminate and fail to notify the client that the session is complete or abended. All that appears in the log is:

2019-10-01 14:51:18.547 UTC [7377] LOG:  could not receive data from client: Connection reset by peer

I had determined the backend PID by running the following query:

tpcc=> select pg_backend_pid();
 pg_backend_pid
----------------
           7377
(1 row)

I then explored all nodes for this process and found it:

[centos@ip-172-152-54-81 ~]$ ps -ef | grep 7377
yugabyte  7377 24769  0 14:43 ?        00:00:00 postgres: tpcc tpcc 172.152.48.233(33652) SELECT
centos    7533  7034  0 14:46 pts/0    00:00:00 grep --color=auto 7377

I had run a number of pstack 7377 commands against the backend pid to determine if there was a call where it was getting hung up:

[centos@ip-172-152-54-81 ~]$ sudo pstack 7377
Thread 1 (process 7377):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007fd48e97585c in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /home/yugabyte/yb-software/yugabyte-2.0.1.0-b1-centos-x86_64/linuxbrew/lib/libstdc++.so.6
#2  0x00007fd48f0b7664 in yb::pggate::PgDocOp::GetResult (this=0x2c3f0e0, result_set=0x2c61d68) at ../../src/yb/yql/pggate/pg_doc_op.cc:113
#3  0x00007fd48f0a2d12 in yb::pggate::PgDml::Fetch (this=this@entry=0x2c61c80, natts=<optimized out>, values=values@entry=0x2c97ba8, isnulls=isnulls@entry=0x2c97bb0, syscols=syscols@entry=0x7ffc234bd700, has_data=0x7ffc234bd6f7) at ../../src/yb/yql/pggate/pg_dml.cc:259
#4  0x00007fd48f08dcd3 in yb::pggate::PgApiImpl::DmlFetch (this=<optimized out>, handle=handle@entry=0x2c61c80, natts=<optimized out>, values=values@entry=0x2c97ba8, isnulls=isnulls@entry=0x2c97bb0, syscols=0x7ffc234bd700, has_data=0x7ffc234bd6f7) at ../../src/yb/yql/pggate/pggate.cc:600
#5  0x00007fd48f0892db in yb::pggate::YBCPgDmlFetch (handle=handle@entry=0x2c61c80, natts=<optimized out>, values=values@entry=0x2c97ba8, isnulls=isnulls@entry=0x2c97bb0, syscols=syscols@entry=0x7ffc234bd700, has_data=has_data@entry=0x7ffc234bd6f7) at ../../src/yb/yql/pggate/ybc_pggate.cc:426
#6  0x000000000069d13f in ybcIterateForeignScan (node=0x2c77448) at ../../../../../../src/postgres/src/backend/executor/ybc_fdw.c:477
#7  0x0000000000691b53 in ForeignNext (node=node@entry=0x2c77448) at ../../../../../../src/postgres/src/backend/executor/nodeForeignscan.c:56
#8  0x000000000066bb71 in ExecScanFetch (recheckMtd=0x691a10 <ForeignRecheck>, accessMtd=0x691aa0 <ForeignNext>, node=0x2c77448) at ../../../../../../src/postgres/src/backend/executor/execScan.c:95
#9  ExecScan (node=0x2c77448, accessMtd=0x691aa0 <ForeignNext>, recheckMtd=0x691a10 <ForeignRecheck>) at ../../../../../../src/postgres/src/backend/executor/execScan.c:145
#10 0x0000000000674c3c in ExecProcNode (node=0x2c77448) at ../../../../../../src/postgres/src/include/executor/executor.h:247
#11 fetch_input_tuple (aggstate=0x2c76fe0) at ../../../../../../src/postgres/src/backend/executor/nodeAgg.c:411
#12 agg_retrieve_direct (aggstate=0x2c76fe0) at ../../../../../../src/postgres/src/backend/executor/nodeAgg.c:1892
#13 ExecAgg (pstate=0x2c76fe0) at ../../../../../../src/postgres/src/backend/executor/nodeAgg.c:1702
#14 0x000000000066229b in ExecProcNode (node=0x2c76fe0) at ../../../../../../src/postgres/src/include/executor/executor.h:247
#15 ExecutePlan (execute_once=<optimized out>, dest=0x2c2c7e0, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x2c76fe0, estate=0x2c76d80) at ../../../../../../src/postgres/src/backend/executor/execMain.c:1729
#16 standard_ExecutorRun (queryDesc=0x2c67f40, direction=<optimized out>, count=0, execute_once=<optimized out>) at ../../../../../../src/postgres/src/backend/executor/execMain.c:365
#17 0x00007fd481a593fd in ybpgm_ExecutorRun (queryDesc=0x2c67f40, direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at ../../../../../src/postgres/contrib/yb_pg_metrics/yb_pg_metrics.c:402
#18 0x00000000007e4b9b in PortalRunSelect (portal=portal@entry=0x2b90ea0, forward=forward@entry=true, count=0, count@entry=9223372036854775807, dest=dest@entry=0x2c2c7e0) at ../../../../../../src/postgres/src/backend/tcop/pquery.c:942
#19 0x00000000007e62b9 in PortalRun (portal=portal@entry=0x2b90ea0, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x2c2c7e0, altdest=altdest@entry=0x2c2c7e0, completionTag=0x7ffc234bdc60 "") at ../../../../../../src/postgres/src/backend/tcop/pquery.c:783
#20 0x00000000007e1f1c in exec_simple_query (query_string=0x2b118b0 "SELECT count(*) from order_line;") at ../../../../../../src/postgres/src/backend/tcop/postgres.c:1150
#21 0x00000000007e391f in PostgresMain (argc=<optimized out>, argv=argv@entry=0x2b44998, dbname=0x2b44850 "tpcc", username=0x2b44838 "tpcc") at ../../../../../../src/postgres/src/backend/tcop/postgres.c:4440
#22 0x0000000000493165 in BackendRun (port=0x2b30630) at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:4398
#23 BackendStartup (port=0x2b30630) at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:4064
#24 ServerLoop () at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:1724
#25 0x000000000075d838 in PostmasterMain (argc=argc@entry=23, argv=argv@entry=0x2b39d10) at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:1387
#26 0x00000000006b577a in PostgresServerProcessMain (argc=23, argv=0x2b39d10) at ../../../../../../src/postgres/src/backend/main/main.c:234
#27 0x00000000006b5979 in main ()

The connection had been launched from a client machine that resolved the host through a NLB. Not sure if that is significant.

Here is the stack trace of ysqlsh after I aborted the with CTRL-BREAK:

[ec2-user@ip-172-152-48-233 ~]$ gdb /home/ec2-user/yugabyte-2.0.0.0/postgres/bin/ysqlsh core.8378
GNU gdb (GDB) Red Hat Enterprise Linux 8.0.1-30.amzn2.0.3
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /home/ec2-user/yugabyte-2.0.0.0/postgres/bin/ysqlsh...done.
[New LWP 8378]

warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.

warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.
Core was generated by `yugabyte-2.0.0.0/bin/ysqlsh -h field-ysql-test-lb-ab6b62e1b2bfbf6a.elb.us-east-'.
Program terminated with signal SIGQUIT, Quit.
#0  0x00007ff4507f85b0 in __poll_nocancel () at ../sysdeps/unix/syscall-template.S:84
84	../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) bt
#0  0x00007ff4507f85b0 in __poll_nocancel () at ../sysdeps/unix/syscall-template.S:84
#1  0x00007ff4522a8a9d in pqSocketPoll (end_time=-1, forWrite=0, forRead=7314880, sock=<optimized out>)
    at ../../../../../../src/postgres/src/interfaces/libpq/fe-misc.c:1127
#2  pqSocketCheck (conn=conn@entry=0x6f9dc0, forRead=forRead@entry=1, forWrite=forWrite@entry=0, end_time=end_time@entry=-1)
    at ../../../../../../src/postgres/src/interfaces/libpq/fe-misc.c:1069
#3  0x00007ff4522a95b6 in pqSocketCheck (end_time=end_time@entry=-1, forWrite=forWrite@entry=0, forRead=forRead@entry=1, conn=conn@entry=0x6f9dc0)
    at ../../../../../../src/postgres/src/interfaces/libpq/fe-misc.c:1051
#4  pqWaitTimed (forRead=forRead@entry=1, forWrite=forWrite@entry=0, conn=conn@entry=0x6f9dc0, finish_time=finish_time@entry=-1)
    at ../../../../../../src/postgres/src/interfaces/libpq/fe-misc.c:1001
#5  0x00007ff4522a9621 in pqWait (forRead=forRead@entry=1, forWrite=forWrite@entry=0, conn=conn@entry=0x6f9dc0)
    at ../../../../../../src/postgres/src/interfaces/libpq/fe-misc.c:986
#6  0x00007ff4522a6439 in PQgetResult (conn=0x6f9dc0) at ../../../../../../src/postgres/src/interfaces/libpq/fe-exec.c:1783
#7  0x00007ff4522a6985 in PQexecFinish (conn=<optimized out>) at ../../../../../../src/postgres/src/interfaces/libpq/fe-exec.c:2069
#8  PQexec (conn=0x6f9dc0, query=query@entry=0x726230 "SELECT count(*) from order_line;")
    at ../../../../../../src/postgres/src/interfaces/libpq/fe-exec.c:1903
#9  0x000000000040d4e1 in SendQuery (query=0x726230 "SELECT count(*) from order_line;") at ../../../../../../src/postgres/src/bin/psql/common.c:1438
#10 0x0000000000422a0c in MainLoop (source=0x7ff450ab24c0 <_IO_2_1_stdin_>) at ../../../../../../src/postgres/src/bin/psql/mainloop.c:429
#11 0x000000000040695b in main (argc=<optimized out>, argv=0x7ffeb1907818) at ../../../../../../src/postgres/src/bin/psql/startup.c:414
@ajcaldera1 ajcaldera1 added kind/bug This issue is a bug area/ysql Yugabyte SQL (YSQL) labels Oct 1, 2019
@ajcaldera1 ajcaldera1 added this to To do in YSQL via automation Oct 1, 2019
@ajcaldera1
Copy link
Contributor Author

Interestingly enough, if I connect directly to a node instead of going via the NLB, I got a successful return:

[ec2-user@ip-172-152-48-233 ~]$ yugabyte-2.0.0.0/bin/ysqlsh -h 172.152.54.81 -U tpcc -d tpcc
ysqlsh (11.2-YB-2.0.0.0-b0)
Type "help" for help.

tpcc=> select pg_backend_pid();
 pg_backend_pid
----------------
           9914
(1 row)

tpcc=> \timing
Timing is on.
tpcc=> select count(*) from order_line;
  count
----------
 30009855
(1 row)

Time: 365029.226 ms (06:05.029)

@ajcaldera1
Copy link
Contributor Author

ajcaldera1 commented Oct 1, 2019

Well I think I found a new clue:

Connection Idle Timeout
For each TCP request that a client makes through a Network Load Balancer, the state of that connection is tracked. If no data is sent through the connection by either the client or target for longer than the idle timeout, the connection is closed. If a client or a target sends data after the idle timeout period elapses, it receives a TCP RST packet to indicate that the connection is no longer valid.

Elastic Load Balancing sets the idle timeout value for TCP flows to 350 seconds. You cannot modify this value. For TCP listeners, clients or targets can use TCP keepalive packets to reset the idle timeout. TCP keepalive packets are not supported for TLS listeners.

So apparently the option does exist on the client to turn on keepalives (OFF by default in Linux environments). We should still respond to the TCP RST packet in the client and terminate the client gracefully instead of leaving it hanging indefinitely.

I tried setting "keepalives=1" on the invocation of ysqlsh and that did not seem to help.

@ndeodhar ndeodhar moved this from To do to Backlog in YSQL Aug 5, 2020
@yugabyte-ci yugabyte-ci added the priority/medium Medium priority issue label Jun 9, 2022
@yugabyte-ci yugabyte-ci added kind/enhancement This is an enhancement of an existing feature and removed kind/bug This issue is a bug labels Sep 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ysql Yugabyte SQL (YSQL) kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue
Projects
Status: No status
YSQL
  
Backlog
Development

No branches or pull requests

4 participants