Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: User job procedure using plpython crashes the database #5326

Closed
justinozavala opened this issue Feb 14, 2023 · 4 comments · Fixed by #5343
Closed

[Bug]: User job procedure using plpython crashes the database #5326

justinozavala opened this issue Feb 14, 2023 · 4 comments · Fixed by #5343
Assignees
Labels

Comments

@justinozavala
Copy link

What type of bug is this?

Crash

What subsystems and features are affected?

Background worker

What happened?

The database crashes when i try to run a scheduled procedure written in plpython.

TimescaleDB version affected

2.9.3

PostgreSQL version used

13.8

What operating system did you use?

Windows 10

What installation method did you use?

Other

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

No response

How can we reproduce the bug?

Create a procedure in plpython:

`CREATE OR REPLACE PROCEDURE public._calculate_delta(job_id integer, config jsonb)
 LANGUAGE plpython3u
AS $procedure$
  	plpy.info("Hello world")
  $procedure$
;`

Schedule it:
`SELECT add_job('_calculate_delta', '10m', config => '{}');`

The database crashes and next_start in timescaledb_information.jobs is set to infinity.
@shhnwz shhnwz self-assigned this Feb 15, 2023
@shhnwz
Copy link
Contributor

shhnwz commented Feb 15, 2023

Hi @justinozavala

Thanks for reporting this issue with a reproducible test case. It would be great if you could share the log for the crash too.

@justinozavala
Copy link
Author

Hello, here follows the log:

`2023-02-14 19:07:37.153 -03 [6872] LOG: background worker "User-Defined Action [1006]" (PID 7964) was terminated by exception 0xC0000005
2023-02-14 19:07:37.153 -03 [6872] DETAIL: Failed process was running: CALL public._calculate_delta()
2023-02-14 19:07:37.153 -03 [6872] HINT: See C include file "ntstatus.h" for a description of the hexadecimal value.
2023-02-14 19:07:37.153 -03 [6872] LOG: terminating any other active server processes
2023-02-14 19:07:37.159 -03 [1312] WARNING: terminating connection because of crash of another server process
2023-02-14 19:07:37.159 -03 [1312] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2023-02-14 19:07:37.159 -03 [1312] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2023-02-14 19:07:37.161 -03 [1624] WARNING: terminating connection because of crash of another server process
2023-02-14 19:07:37.161 -03 [1624] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2023-02-14 19:07:37.161 -03 [1624] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2023-02-14 19:07:37.162 -03 [7680] WARNING: terminating connection because of crash of another server process
2023-02-14 19:07:37.162 -03 [7680] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2023-02-14 19:07:37.162 -03 [7680] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2023-02-14 19:07:37.164 -03 [884] WARNING: terminating connection because of crash of another server process
2023-02-14 19:07:37.164 -03 [884] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2023-02-14 19:07:37.164 -03 [884] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2023-02-14 19:07:37.165 -03 [1116] WARNING: terminating connection because of crash of another server process
2023-02-14 19:07:37.165 -03 [1116] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2023-02-14 19:07:37.165 -03 [1116] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2023-02-14 19:07:37.167 -03 [4920] WARNING: terminating connection because of crash of another server process
2023-02-14 19:07:37.167 -03 [4920] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2023-02-14 19:07:37.167 -03 [4920] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2023-02-14 19:07:37.173 -03 [8168] WARNING: terminating connection because of crash of another server process
2023-02-14 19:07:37.173 -03 [8168] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2023-02-14 19:07:37.173 -03 [8168] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2023-02-14 19:07:37.258 -03 [6872] LOG: all server processes terminated; reinitializing
2023-02-14 19:07:37.326 -03 [8104] LOG: database system was interrupted; last known up at 2023-02-14 19:06:37 -03
2023-02-14 19:07:38.065 -03 [1344] FATAL: the database system is in recovery mode

Lot more of this statements....

2023-02-14 19:09:28.894 -03 [2592] FATAL: the database system is in recovery mode
2023-02-14 19:09:33.031 -03 [3496] FATAL: the database system is in recovery mode
2023-02-14 19:09:33.270 -03 [8104] LOG: database system was not properly shut down; automatic recovery in progress
2023-02-14 19:09:33.316 -03 [8104] LOG: redo starts at 8F/E0A3CF10
2023-02-14 19:09:33.330 -03 [8104] LOG: invalid record length at 8F/E0AC84B8: wanted 24, got 0
2023-02-14 19:09:33.330 -03 [8104] LOG: redo done at 8F/E0AC8490
2023-02-14 19:09:34.982 -03 [6872] LOG: database system is ready to accept connections`

ntstatus code says:

0xC0000005
STATUS_ACCESS_VIOLATION

@mkindahl mkindahl assigned mkindahl and unassigned shhnwz Feb 16, 2023
@mkindahl
Copy link
Contributor

Reproducible with the following backtrace:

(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140351424279040) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140351424279040) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140351424279040, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007fa61ccde476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007fa61ccc47f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x0000557b8c191598 in ExceptionalCondition (conditionName=conditionName@entry=0x557b8c2e7110 "MemoryContextIsValid(context)", errorType=errorType@entry=0x557b8c1f000f "BadArgument", fileName=fileName@entry=0x557b8c35df98 "mcxt.c", 
    lineNumber=lineNumber@entry=867) at assert.c:69
#6  0x0000557b8c1c1af6 in MemoryContextAlloc (context=<optimized out>, size=size@entry=24) at mcxt.c:867
#7  0x00007fa613355198 in PLy_push_execution_context (atomic_context=atomic_context@entry=false) at plpy_main.c:437
#8  0x00007fa6133553f3 in plpython3_call_handler (fcinfo=0x7ffcd2824f20) at plpy_main.c:238
#9  0x0000557b8be121e3 in ExecuteCallStmt (stmt=stmt@entry=0x557b8c729260, params=<optimized out>, atomic=atomic@entry=false, dest=dest@entry=0x557b8c45a960 <donothingDR>) at functioncmds.c:2311
#10 0x00007fa6133f26e8 in job_execute_procedure (funcexpr=funcexpr@entry=0x557b8c728db8) at /home/mats/work/timescale/timescaledb+support/tsl/src/bgw_policy/job.c:569
#11 0x00007fa6133f3245 in job_execute (job=0x557b8c7286b8) at /home/mats/work/timescale/timescaledb+support/tsl/src/bgw_policy/job.c:657
#12 0x00007fa61355a5d6 in ts_bgw_job_execute (job=job@entry=0x557b8c7286b8) at /home/mats/work/timescale/timescaledb+support/src/bgw/job.c:1039
#13 0x00007fa61355a866 in ts_bgw_job_entrypoint (fcinfo=<optimized out>) at /home/mats/work/timescale/timescaledb+support/src/bgw/job.c:1179
#14 0x0000557b8bf98b42 in StartBackgroundWorker () at bgworker.c:875
#15 0x0000557b8bfa6f3f in do_start_bgworker (rw=rw@entry=0x557b8c721b10) at postmaster.c:5866
#16 0x0000557b8bfa70b9 in maybe_start_bgworkers () at postmaster.c:6091
#17 0x0000557b8bfa7904 in sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5255
#18 <signal handler called>
#19 0x00007fa61cdb774d in __GI___select (nfds=nfds@entry=7, readfds=readfds@entry=0x7ffcd2826840, writefds=writefds@entry=0x0, exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7ffcd28267b0)
    at ../sysdeps/unix/sysv/linux/select.c:69
#20 0x0000557b8bfa829f in ServerLoop () at postmaster.c:1709
#21 0x0000557b8bfa95c0 in PostmasterMain (argc=argc@entry=1, argv=argv@entry=0x557b8c663930) at postmaster.c:1417
#22 0x0000557b8bee56e9 in main (argc=1, argv=0x557b8c663930) at main.c:209

@mkindahl
Copy link
Contributor

Fails because the call is in a non-atomic context in the background process and PortalContext is not set.

static PLyExecutionContext *
PLy_push_execution_context(bool atomic_context)
{
	PLyExecutionContext *context;

	/* Pick a memory context similar to what SPI uses. */
	context = (PLyExecutionContext *)
>>>		MemoryContextAlloc(atomic_context ? TopTransactionContext : PortalContext,
						   sizeof(PLyExecutionContext));
	context->curr_proc = NULL;
	context->scratch_ctx = NULL;
	context->next = PLy_execution_contexts;
	PLy_execution_contexts = context;
	return context;
}
(gdb) p atomic_context 
$8 = false
(gdb) p PortalContext
$9 = (MemoryContext) 0x0

mkindahl added a commit to mkindahl/timescaledb that referenced this issue Feb 17, 2023
When executing functions, SPI assumes that `TopTransactionContext` is
used for atomic execution contexts and `PortalContext` is used for
non-atomic contexts. Since jobs need to be able to commit and start
transactions, they are executing in a non-atomic context hence
`PortalContext` will be used. This is not a problem for PL/PgSQL
executor, but other executors (such as PL/Python) it would be.

This commit fixes the issue by setting the `PortalContext` variable to
the portal context created for the portal and restores it (to NULL)
after execution.

Fixes timescale#5326
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Feb 17, 2023
When executing functions, SPI assumes that `TopTransactionContext` is
used for atomic execution contexts and `PortalContext` is used for
non-atomic contexts. Since jobs need to be able to commit and start
transactions, they are executing in a non-atomic context hence
`PortalContext` will be used. This is not a problem for PL/PgSQL
executor, but other executors (such as PL/Python) it would be.

This commit fixes the issue by setting the `PortalContext` variable to
the portal context created for the portal and restores it (to NULL)
after execution.

Fixes timescale#5326
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Feb 17, 2023
When executing functions, SPI assumes that `TopTransactionContext` is
used for atomic execution contexts and `PortalContext` is used for
non-atomic contexts. Since jobs need to be able to commit and start
transactions, they are executing in a non-atomic context hence
`PortalContext` will be used. This is not a problem for PL/PgSQL
executor, but other executors (such as PL/Python) it would be.

This commit fixes the issue by setting the `PortalContext` variable to
the portal context created for the portal and restores it (to NULL)
after execution.

Fixes timescale#5326
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Feb 17, 2023
When executing functions, SPI assumes that `TopTransactionContext` is
used for atomic execution contexts and `PortalContext` is used for
non-atomic contexts. Since jobs need to be able to commit and start
transactions, they are executing in a non-atomic context hence
`PortalContext` will be used. This is not a problem for PL/PgSQL
executor, but other executors (such as PL/Python) it would be.

This commit fixes the issue by setting the `PortalContext` variable to
the portal context created for the portal and restores it (to NULL)
after execution.

Fixes timescale#5326
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Feb 17, 2023
When executing functions, SPI assumes that `TopTransactionContext` is
used for atomic execution contexts and `PortalContext` is used for
non-atomic contexts. Since jobs need to be able to commit and start
transactions, they are executing in a non-atomic context hence
`PortalContext` will be used. This is not a problem for PL/PgSQL
executor, but other executors (such as PL/Python) it would be.

This commit fixes the issue by setting the `PortalContext` variable to
the portal context created for the portal and restores it (to NULL)
after execution.

Fixes timescale#5326
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Feb 18, 2023
When executing functions, SPI assumes that `TopTransactionContext` is
used for atomic execution contexts and `PortalContext` is used for
non-atomic contexts. Since jobs need to be able to commit and start
transactions, they are executing in a non-atomic context hence
`PortalContext` will be used, but `PortalContext` is not set when
starting the job. This is not a problem for PL/PgSQL executor, but for
other executors (such as PL/Python) it would be.

This commit fixes the issue by setting the `PortalContext` variable to
the portal context created for the portal and restores it (to NULL)
after execution.

Fixes timescale#5326
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Feb 18, 2023
When executing functions, SPI assumes that `TopTransactionContext` is
used for atomic execution contexts and `PortalContext` is used for
non-atomic contexts. Since jobs need to be able to commit and start
transactions, they are executing in a non-atomic context hence
`PortalContext` will be used, but `PortalContext` is not set when
starting the job. This is not a problem for PL/PgSQL executor, but for
other executors (such as PL/Python) it would be.

This commit fixes the issue by setting the `PortalContext` variable to
the portal context created for the portal and restores it (to NULL)
after execution.

Fixes timescale#5326
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Feb 20, 2023
When executing functions, SPI assumes that `TopTransactionContext` is
used for atomic execution contexts and `PortalContext` is used for
non-atomic contexts. Since jobs need to be able to commit and start
transactions, they are executing in a non-atomic context hence
`PortalContext` will be used, but `PortalContext` is not set when
starting the job. This is not a problem for PL/PgSQL executor, but for
other executors (such as PL/Python) it would be.

This commit fixes the issue by setting the `PortalContext` variable to
the portal context created for the portal and restores it (to NULL)
after execution.

Fixes timescale#5326
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Feb 20, 2023
When executing functions, SPI assumes that `TopTransactionContext` is
used for atomic execution contexts and `PortalContext` is used for
non-atomic contexts. Since jobs need to be able to commit and start
transactions, they are executing in a non-atomic context hence
`PortalContext` will be used, but `PortalContext` is not set when
starting the job. This is not a problem for PL/PgSQL executor, but for
other executors (such as PL/Python) it would be.

This commit fixes the issue by setting the `PortalContext` variable to
the portal context created for the portal and restores it (to NULL)
after execution.

Fixes timescale#5326
mkindahl added a commit that referenced this issue Feb 20, 2023
When executing functions, SPI assumes that `TopTransactionContext` is
used for atomic execution contexts and `PortalContext` is used for
non-atomic contexts. Since jobs need to be able to commit and start
transactions, they are executing in a non-atomic context hence
`PortalContext` will be used, but `PortalContext` is not set when
starting the job. This is not a problem for PL/PgSQL executor, but for
other executors (such as PL/Python) it would be.

This commit fixes the issue by setting the `PortalContext` variable to
the portal context created for the portal and restores it (to NULL)
after execution.

Fixes #5326
github-actions bot pushed a commit that referenced this issue Mar 1, 2023
When executing functions, SPI assumes that `TopTransactionContext` is
used for atomic execution contexts and `PortalContext` is used for
non-atomic contexts. Since jobs need to be able to commit and start
transactions, they are executing in a non-atomic context hence
`PortalContext` will be used, but `PortalContext` is not set when
starting the job. This is not a problem for PL/PgSQL executor, but for
other executors (such as PL/Python) it would be.

This commit fixes the issue by setting the `PortalContext` variable to
the portal context created for the portal and restores it (to NULL)
after execution.

Fixes #5326

(cherry picked from commit 0cbd740)
mkindahl added a commit that referenced this issue Mar 1, 2023
When executing functions, SPI assumes that `TopTransactionContext` is
used for atomic execution contexts and `PortalContext` is used for
non-atomic contexts. Since jobs need to be able to commit and start
transactions, they are executing in a non-atomic context hence
`PortalContext` will be used, but `PortalContext` is not set when
starting the job. This is not a problem for PL/PgSQL executor, but for
other executors (such as PL/Python) it would be.

This commit fixes the issue by setting the `PortalContext` variable to
the portal context created for the portal and restores it (to NULL)
after execution.

Fixes #5326

(cherry picked from commit 0cbd740)
svenklemm pushed a commit that referenced this issue Mar 3, 2023
When executing functions, SPI assumes that `TopTransactionContext` is
used for atomic execution contexts and `PortalContext` is used for
non-atomic contexts. Since jobs need to be able to commit and start
transactions, they are executing in a non-atomic context hence
`PortalContext` will be used, but `PortalContext` is not set when
starting the job. This is not a problem for PL/PgSQL executor, but for
other executors (such as PL/Python) it would be.

This commit fixes the issue by setting the `PortalContext` variable to
the portal context created for the portal and restores it (to NULL)
after execution.

Fixes #5326

(cherry picked from commit 0cbd740)
timescale-automation pushed a commit that referenced this issue Mar 3, 2023
When executing functions, SPI assumes that `TopTransactionContext` is
used for atomic execution contexts and `PortalContext` is used for
non-atomic contexts. Since jobs need to be able to commit and start
transactions, they are executing in a non-atomic context hence
`PortalContext` will be used, but `PortalContext` is not set when
starting the job. This is not a problem for PL/PgSQL executor, but for
other executors (such as PL/Python) it would be.

This commit fixes the issue by setting the `PortalContext` variable to
the portal context created for the portal and restores it (to NULL)
after execution.

Fixes #5326

(cherry picked from commit 0cbd740)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants