Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Segmentation fault in decompress chunk after appending new rows #5411

Closed
kgyrtkirk opened this issue Mar 7, 2023 · 9 comments
Closed

Comments

@kgyrtkirk
Copy link
Contributor

What type of bug is this?

Crash

What subsystems and features are affected?

Compression

What happened?

the testcase is more elaborate; but most likely the only relevant part is how the compression was enabled (segmentby/orderby)

segmentation fault have happened beneath tsl/src/nodes/decompress_chunk/exec.c:394 in:

Core was generated by `postgres:'.
Program terminated with signal SIGSEGV, Segmentation fault.

warning: Section `.reg-xstate/78984' in core file too small.
#0  toast_build_flattened_tuple (tupleDesc=0x557aaa2c5148, values=<optimized out>, isnull=0x557aaa132de8) at heaptoast.c:592
592				if (VARATT_IS_EXTERNAL(new_value))
(gdb) bt
#0  toast_build_flattened_tuple (tupleDesc=0x557aaa2c5148, values=<optimized out>, isnull=0x557aaa132de8) at heaptoast.c:592
#1  0x0000557aa8f70fe3 in ExecEvalWholeRowVar (state=state@entry=0x557aaa133778, op=op@entry=0x557aaa133fa0, econtext=econtext@entry=0x557aa9fd7458) at execExprInterp.c:4179
#2  0x0000557aa8f7150a in ExecInterpExpr (state=0x557aaa133778, econtext=0x557aa9fd7458, isnull=0x7ffde79bafdf) at execExprInterp.c:616
#3  0x0000557aa8f6e6f8 in ExecInterpExprStillValid (state=0x557aaa133778, econtext=0x557aa9fd7458, isNull=0x7ffde79bafdf) at execExprInterp.c:1826
#4  0x00007fb8c3f21bad in ExecEvalExprSwitchContext (isNull=0x7ffde79bafdf, econtext=0x557aa9fd7458, state=0x557aaa133778) at /home/dev/pg/REL_15_2/include/postgresql/server/executor/executor.h:341
#5  ExecProject (projInfo=0x557aaa133770) at /home/dev/pg/REL_15_2/include/postgresql/server/executor/executor.h:375
#6  decompress_chunk_exec (node=<optimized out>) at /home/dev/timescaledb/tsl/src/nodes/decompress_chunk/exec.c:394
#7  0x0000557aa8f938b9 in ExecCustomScan (pstate=<optimized out>) at nodeCustom.c:115
#8  0x0000557aa8f7e168 in ExecProcNodeFirst (node=0x557aaa132b10) at execProcnode.c:464
#9  0x0000557aa8fafc8f in ExecProcNode (node=node@entry=0x557aaa132b10) at ../../../src/include/executor/executor.h:259
[...]

TimescaleDB version affected

2.10.1 and current main

PostgreSQL version used

15

What operating system did you use?

timescale/timescaledb:2.10.1-pg15

What installation method did you use?

Docker

What platform did you run on?

Other

Relevant log output and stack trace

https://github.com/kgyrtkirk/reprox/actions/runs/4354697899/jobs/7610324856#step:3:1441

How can we reproduce the bug?

run the case by building the image of:
https://github.com/kgyrtkirk/reprox/tree/comp-test_segfault
@mickael-choisnet
Copy link

mickael-choisnet commented Mar 8, 2023

Hello @kgyrtkirk,

Do you think this bug is specific to PostgreSQL 15 or can it occur on PostgreSQL 14?

I ran into a Segmentation fault error on two of my databases after upgrading timescaledb from 2.9.3 to 2.10.0 (running on PostgreSQL 14.5).

The crash occurs when querying a hypertable with a join. It does only happens after the compression jobs first ran after the upgrade.
I'm not sure it's the same issue. I am trying to find a way to reproduce it but so far I haven't succeed.

@kgyrtkirk
Copy link
Contributor Author

kgyrtkirk commented Mar 8, 2023

do you happen to have a core file or a backtrace?
I'll run it with PG14 - but I've noticed that there are some differences - some of them arise from the fact that the random() fn returns different numbers regardless the same seed is set ; which negatively impact cross-version reproduction rate - however I was able to reproduce this issue by altering the start seed of the test - backtrace is the same; so I think this could happen on pg14 as well

please try to get at least a backtrace so that we can validate if its the same issue or a different one

https://github.com/kgyrtkirk/reprox/tree/comp-test_segfault-14

@mickael-choisnet
Copy link

I don't have the necessary tools installed on the VM to create a backtrace (not yet still but I can look into installing what's needed).
I suppose now that I have downgraded the timescaledb extension and the problem has vanished it is not sure I'll be able to reproduce it the same way...

I'll try to make a backup of my cluster with pgbackrest and test the update again on another VM (I cannot juste crash the one I fixed, it is not production but it is used)

@mickael-choisnet
Copy link

mickael-choisnet commented Mar 8, 2023

@kgyrtkirk,

I've successfully done what I said:

  • make a backup of the cluster with pgBackRest
  • Restore it on another VM
  • Install gdb and the necessary packages to debug postgresql-14 on this VM
  • Update timescaledb from 2.9.3 to 2.10.0 for the restored database
  • Run the query that makes the database to crash (and it does crash)

I've attached gdb to the backend process of my query and here is the stacktrace I get:

Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x000055e76670b52e in CheckVarSlotCompatibility (attnum=1, vartype=25, slot=<optimized out>) at ./build/../src/backend/executor/execExprInterp.c:1904
1904	./build/../src/backend/executor/execExprInterp.c: No such file or directory.
#0  0x000055e76670b52e in CheckVarSlotCompatibility (attnum=1, vartype=25, slot=<optimized out>) at ./build/../src/backend/executor/execExprInterp.c:1904
#1  0x000055e76670bbc4 in CheckVarSlotCompatibility (slot=0x0, vartype=<optimized out>, attnum=<optimized out>) at ./build/../src/backend/executor/execExprInterp.c:1902
#2  CheckExprStillValid (state=state@entry=0x55e76735d7d8, econtext=econtext@entry=0x55e76735e0a8) at ./build/../src/backend/executor/execExprInterp.c:1868
#3  0x000055e76670bbeb in ExecInterpExprStillValid (state=0x55e76735d7d8, econtext=0x55e76735e0a8, isNull=0x7ffd497afe57) at ./build/../src/backend/executor/execExprInterp.c:1818
#4  0x000055e76673686c in ExecEvalExpr (isNull=0x7ffd497afe57, econtext=0x55e76735e0a8, state=<optimized out>) at ./build/../src/include/executor/executor.h:320
#5  ExecIndexEvalRuntimeKeys (econtext=econtext@entry=0x55e76735e0a8, runtimeKeys=<optimized out>, numRuntimeKeys=<optimized out>) at ./build/../src/backend/executor/nodeIndexscan.c:634
#6  0x000055e766736986 in ExecReScanIndexScan (node=node@entry=0x55e767027388) at ./build/../src/backend/executor/nodeIndexscan.c:568
#7  0x000055e766705c38 in ExecReScan (node=0x55e767027388) at ./build/../src/backend/executor/execAmi.c:183
#8  0x000055e766736215 in ExecIndexScan (pstate=0x55e767027388) at ./build/../src/backend/executor/nodeIndexscan.c:530
#9  0x00007f36b77325e0 in ExecProcNode (node=0x55e767027388) at /usr/include/postgresql/14/server/executor/executor.h:257
#10 decompress_chunk_create_tuple (state=0x55e76701ac68) at ./tsl/src/nodes/decompress_chunk/exec.c:426
#11 decompress_chunk_exec (node=0x55e76701ac68) at ./tsl/src/nodes/decompress_chunk/exec.c:373
#12 0x000055e7667292fa in ExecProcNode (node=0x55e76701ac68) at ./build/../src/include/executor/executor.h:257
#13 ExecAppend (pstate=0x55e76701a038) at ./build/../src/backend/executor/nodeAppend.c:360
#14 0x000055e766740485 in ExecProcNode (node=0x55e76701a038) at ./build/../src/include/executor/executor.h:257
#15 ExecNestLoop (pstate=0x55e767019e98) at ./build/../src/backend/executor/nodeNestloop.c:109
#16 0x000055e766737fc9 in ExecProcNode (node=0x55e767019e98) at ./build/../src/include/executor/executor.h:257
#17 ExecLimit (pstate=0x55e767019ba8) at ./build/../src/backend/executor/nodeLimit.c:96
#18 0x000055e766713333 in ExecProcNode (node=0x55e767019ba8) at ./build/../src/include/executor/executor.h:257
#19 ExecutePlan (execute_once=<optimized out>, dest=0x55e767325628, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x55e767019ba8, estate=0x55e7670198d8) at ./build/../src/backend/executor/execMain.c:1551
#20 standard_ExecutorRun (queryDesc=0x55e7670389e8, direction=<optimized out>, count=0, execute_once=<optimized out>) at ./build/../src/backend/executor/execMain.c:361
#21 0x000055e76688d38c in PortalRunSelect (portal=0x55e766f21458, forward=<optimized out>, count=0, dest=<optimized out>) at ./build/../src/backend/tcop/pquery.c:921
#22 0x000055e76688e71a in PortalRun (portal=0x55e766f21458, count=9223372036854775807, isTopLevel=<optimized out>, run_once=<optimized out>, dest=0x55e767325628, altdest=0x55e767325628, qc=0x7ffd497b0340) at ./build/../src/backend/tcop/pquery.c:765
#23 0x000055e76688a9a5 in exec_simple_query () at ./build/../src/backend/tcop/postgres.c:1213
#24 0x000055e76688c23b in PostgresMain () at ./build/../src/backend/tcop/postgres.c:4508
#25 0x000055e766809db6 in BackendRun (port=<optimized out>, port=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:4537
#26 BackendStartup (port=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:4259
#27 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1745
#28 0x000055e76680ac44 in PostmasterMain (argc=5, argv=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:1417
#29 0x000055e76657c762 in main (argc=5, argv=0x55e766e6ec00) at ./build/../src/backend/main/main.c:209
Continuing.

Program terminated with signal SIGSEGV, Segmentation fault.
The program no longer exists.

Is this enough?

@kgyrtkirk
Copy link
Contributor Author

awesome! this seems like a slightly different backtrace - I think its key to also do the upgrade to reproduce your issue;
you may report this separetly as I'm not convinced that if we fix this issue yours will be fixed as well!

if you have any suggestions and/or ideas how this could be reproduced (beyond the 2.9.3->2.10.0 upgrade step) feel free to add them!

@mickael-choisnet
Copy link

Thanks,

I'll do that.
I'll try to give every detail I can think of that may pertain to the state of my database when I performed the update.

@kgyrtkirk
Copy link
Contributor Author

I've tried to narrow this down - and got to: #5458

kgyrtkirk added a commit to kgyrtkirk/timescaledb that referenced this issue Mar 17, 2023
Decompression produces records which have all the decompressed data set,
but it also retains the fields which are used internally during decompression.
These didn't cause any problem - unless an operation is being done with the
whole row - in which case all the fields which have ended up being non-null
can be a potential segfault source.

Fixes timescale#5458 timescale#5411
kgyrtkirk added a commit to kgyrtkirk/timescaledb that referenced this issue Mar 17, 2023
Decompression produces records which have all the decompressed data
set, but it also retains the fields which are used internally during
decompression.
These didn't cause any problem - unless an operation is being done
with the whole row - in which case all the fields which have ended up
being non-null can be a potential segfault source.

Fixes timescale#5458 timescale#5411
kgyrtkirk added a commit to kgyrtkirk/timescaledb that referenced this issue Mar 20, 2023
Decompression produces records which have all the decompressed data
set, but it also retains the fields which are used internally during
decompression.
These didn't cause any problem - unless an operation is being done
with the whole row - in which case all the fields which have ended up
being non-null can be a potential segfault source.

Fixes timescale#5458 timescale#5411
kgyrtkirk added a commit to kgyrtkirk/timescaledb that referenced this issue Mar 20, 2023
Decompression produces records which have all the decompressed data
set, but it also retains the fields which are used internally during
decompression.
These didn't cause any problem - unless an operation is being done
with the whole row - in which case all the fields which have ended up
being non-null can be a potential segfault source.

Fixes timescale#5458 timescale#5411
kgyrtkirk added a commit to kgyrtkirk/timescaledb that referenced this issue Apr 4, 2023
Decompression produces records which have all the decompressed data
set, but it also retains the fields which are used internally during
decompression.
These didn't cause any problem - unless an operation is being done
with the whole row - in which case all the fields which have ended up
being non-null can be a potential segfault source.

Fixes timescale#5458 timescale#5411
kgyrtkirk added a commit to kgyrtkirk/timescaledb that referenced this issue Apr 5, 2023
Decompression produces records which have all the decompressed data
set, but it also retains the fields which are used internally during
decompression.
These didn't cause any problem - unless an operation is being done
with the whole row - in which case all the fields which have ended up
being non-null can be a potential segfault source.

Fixes timescale#5458 timescale#5411
kgyrtkirk added a commit to kgyrtkirk/timescaledb that referenced this issue Apr 5, 2023
Decompression produces records which have all the decompressed data
set, but it also retains the fields which are used internally during
decompression.
These didn't cause any problem - unless an operation is being done
with the whole row - in which case all the fields which have ended up
being non-null can be a potential segfault source.

Fixes timescale#5458 timescale#5411
kgyrtkirk added a commit that referenced this issue Apr 6, 2023
Decompression produces records which have all the decompressed data
set, but it also retains the fields which are used internally during
decompression.
These didn't cause any problem - unless an operation is being done
with the whole row - in which case all the fields which have ended up
being non-null can be a potential segfault source.

Fixes #5458 #5411
@sb230132
Copy link
Contributor

sb230132 commented Apr 7, 2023

Closing this issue as its fixed in #5458

@sb230132 sb230132 closed this as completed Apr 7, 2023
@kgyrtkirk
Copy link
Contributor Author

I thinked it will close it automagically - it seems like github needs to have the "fixes" keyword repeated to do that!
Thank you @sb230132 for keeping an eye!

akuzm pushed a commit to akuzm/timescaledb that referenced this issue Apr 14, 2023
Decompression produces records which have all the decompressed data
set, but it also retains the fields which are used internally during
decompression.
These didn't cause any problem - unless an operation is being done
with the whole row - in which case all the fields which have ended up
being non-null can be a potential segfault source.

Fixes timescale#5458 timescale#5411

cherry-picked from 975e9ca
akuzm pushed a commit to akuzm/timescaledb that referenced this issue Apr 14, 2023
Decompression produces records which have all the decompressed data
set, but it also retains the fields which are used internally during
decompression.
These didn't cause any problem - unless an operation is being done
with the whole row - in which case all the fields which have ended up
being non-null can be a potential segfault source.

Fixes timescale#5458 timescale#5411

cherry-picked from 975e9ca
akuzm pushed a commit that referenced this issue Apr 14, 2023
Decompression produces records which have all the decompressed data
set, but it also retains the fields which are used internally during
decompression.
These didn't cause any problem - unless an operation is being done
with the whole row - in which case all the fields which have ended up
being non-null can be a potential segfault source.

Fixes #5458 #5411

cherry-picked from 975e9ca
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants