[Bug]: Some incorrect memory handling. #5311

Medvecrab · 2023-02-13T12:52:36Z

What type of bug is this?

Data corruption, Unexpected error

What subsystems and features are affected?

Other

What happened?

Greetings! While testing PostgreSQL with TimescaleDB under valgrind I've come across two cases of incorrect memory handling.

TimescaleDB version is 2.9.2, PostgreSQL version is 15.

Follow the steps from "How can we reproduce the bug" section.

In valgrind log you'll see one error that occures on creating the extension (check the log right after executing CREATE EXTENSION timescaledb;):

==00:00:00:26.667 3107== VALGRINDERROR-BEGIN
==00:00:00:26.668 3107== Uninitialised byte(s) found during client check request
==00:00:00:26.668 3107==    at 0x5D1E78: PageAddItemExtended (bufpage.c:349)
==00:00:00:26.668 3107==    by 0x2D15F6: RelationPutHeapTuple (hio.c:62)
==00:00:00:26.668 3107==    by 0x2C0F4D: heap_insert (heapam.c:2075)
==00:00:00:26.668 3107==    by 0x35CC13: CatalogTupleInsert (indexing.c:229)
==00:00:00:26.668 3107==    by 0xE7A3034: ts_catalog_insert_only (catalog.c:653)
==00:00:00:26.668 3107==    by 0xE7A305C: ts_catalog_insert (catalog.c:660)
==00:00:00:26.668 3107==    by 0xE7A309A: ts_catalog_insert_values (catalog.c:673)
==00:00:00:26.668 3107==    by 0xE7A9F84: ts_metadata_insert (metadata.c:169)
==00:00:00:26.668 3107==    by 0xE7AA1A9: get_uuid_by_key (metadata.c:218)
==00:00:00:26.668 3107==    by 0xE7AA1A9: get_uuid_by_key (metadata.c:210)
==00:00:00:26.668 3107==    by 0xE7AA1A9: ts_metadata_get_exported_uuid (metadata.c:231)
==00:00:00:26.668 3107==    by 0xE7A0346: build_telemetry_report (telemetry.c:791)
==00:00:00:26.668 3107==    by 0xE7A1D7B: ts_build_version_request (telemetry.c:989)
==00:00:00:26.668 3107==    by 0xE7A1D7B: ts_telemetry_main (telemetry.c:1080)
==00:00:00:26.668 3107==    by 0xE798B4D: ts_bgw_job_run_and_set_next_start (job.c:1302)
==00:00:00:26.668 3107==    by 0xE798B4D: ts_bgw_job_execute (job.c:1027)
==00:00:00:26.668 3107==    by 0xE798B4D: ts_bgw_job_execute (job.c:1016)
==00:00:00:26.668 3107==  Address 0xee65ca6 is 62 bytes inside a block of size 150 client-defined
==00:00:00:26.668 3107==    at 0x743329: palloc0 (mcxt.c:1124)
==00:00:00:26.668 3107==    by 0x27AE01: heap_form_tuple (heaptuple.c:1069)
==00:00:00:26.668 3107==    by 0xE7A308C: ts_catalog_insert_values (catalog.c:671)
==00:00:00:26.668 3107==    by 0xE7A9F84: ts_metadata_insert (metadata.c:169)
==00:00:00:26.668 3107==    by 0xE7AA1A9: get_uuid_by_key (metadata.c:218)
==00:00:00:26.668 3107==    by 0xE7AA1A9: get_uuid_by_key (metadata.c:210)
==00:00:00:26.668 3107==    by 0xE7AA1A9: ts_metadata_get_exported_uuid (metadata.c:231)
==00:00:00:26.668 3107==    by 0xE7A0346: build_telemetry_report (telemetry.c:791)
==00:00:00:26.668 3107==    by 0xE7A1D7B: ts_build_version_request (telemetry.c:989)
==00:00:00:26.668 3107==    by 0xE7A1D7B: ts_telemetry_main (telemetry.c:1080)
==00:00:00:26.668 3107==    by 0xE798B4D: ts_bgw_job_run_and_set_next_start (job.c:1302)
==00:00:00:26.668 3107==    by 0xE798B4D: ts_bgw_job_execute (job.c:1027)
==00:00:00:26.668 3107==    by 0xE798B4D: ts_bgw_job_execute (job.c:1016)
==00:00:00:26.668 3107==    by 0xE798F4A: ts_bgw_job_entrypoint (job.c:1179)
==00:00:00:26.668 3107==    by 0x53ED61: StartBackgroundWorker (bgworker.c:858)
==00:00:00:26.668 3107==    by 0x544A2F: do_start_bgworker (postmaster.c:5830)
==00:00:00:26.668 3107==    by 0x544A2F: maybe_start_bgworkers (postmaster.c:6054)
==00:00:00:26.668 3107==    by 0x54540C: sigusr1_handler (postmaster.c:5211)
==00:00:00:26.668 3107== 
==00:00:00:26.668 3107== VALGRINDERROR-END

To check that there really are problems after creating the extension you could also thoroughly check timescale metadata with following commands:

postgres=# select * from _timescaledb_catalog.metadata;
		key        |                value                 | include_in_telemetry
-------------------+--------------------------------------+----------------------
uuid              | 7edf667c-17df-498c-8ca2-d87c8ea3ab75 | t
install_timestamp | 2023-02-13 10:57:08.673816+07        | t
exported_uuid     | 1800b593-82a6-4ce8-b5ae-fbb5624fab1d | t
(3 rows)

postgres=# create extension pageinspect;
CREATE EXTENSION
postgres=# select * from heap_page_items(get_raw_page('_timescaledb_catalog.metadata', 0)); \gx

-[ RECORD 1 ]-----------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------
lp          | 1
lp_off      | 8064
lp_flags    | 1
lp_len      | 126
t_xmin      | 751
t_xmax      | 0
t_field3    | 590
t_ctid      | (0,1)
t_infomask2 | 3
t_infomask  | 2306
t_hoff      | 24
t_bits      |
t_oid       |
t_data      | \x757569640000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000004b3765646636
3637632d313764662d343938632d386361322d64383763386561336162373501
-[ RECORD 2 ]-----------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------
lp          | 2
lp_off      | 7944
lp_flags    | 1
lp_len      | 119
t_xmin      | 751
t_xmax      | 0
t_field3    | 591
t_ctid      | (0,2)
t_infomask2 | 3
t_infomask  | 2306
t_hoff      | 24
t_bits      |
t_oid       |
t_data      | \x696e7374616c6c5f74696d657374616d7000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000003d323032332d
30322d31332031303a35373a30382e3637333831362b303701
-[ RECORD 3 ]-----------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------
lp          | 3
lp_off      | 7816
lp_flags    | 1
lp_len      | 126
t_xmin      | 755
t_xmax      | 0
t_field3    | 0
t_ctid      | (0,3)
t_infomask2 | 3
t_infomask  | 2306
t_hoff      | 24
t_bits      |
t_oid       |
t_data      | \x6578706f727465645f7575696400000000000000000000006822cd0400000000080000000000000010000000000000007827510900000000e0235109000000004b3138303062
3539332d383261362d346365382d623561652d66626235363234666162316401

Look closely - there's some garbage in the third record!

Why does this happen? strlcpy DOESN'T zero-pad destination string (function ts_metadata_insert, file src/ts_catalog/metadata.c).

The first attached patch is aimed to fix this error. We change strlcpy to strncpy that DOES zero-pad destination.

And the second error from valgrind (well, a group of similar errors):

==00:00:00:27.294 3104== VALGRINDERROR-BEGIN
==00:00:00:27.294 3104== Invalid read of size 1
==00:00:00:27.294 3104==    at 0x484ED16: strlen (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==00:00:00:27.294 3104==    by 0x6F6E9C: namehashfast (catcache.c:144)
==00:00:00:27.294 3104==    by 0x6F6A30: CatalogCacheComputeHashValue (catcache.c:297)
==00:00:00:27.294 3104==    by 0x6F7B14: SearchCatCacheInternal (catcache.c:1233)
==00:00:00:27.294 3104==    by 0x70C690: SearchSysCache (syscache.c:1169)
==00:00:00:27.294 3104==    by 0x70C690: GetSysCacheOid (syscache.c:1293)
==00:00:00:27.294 3104==    by 0x35E4CA: RangeVarGetRelidExtended (namespace.c:331)
==00:00:00:27.294 3104==    by 0xF8A74FD: ts_continuous_agg_find_by_rv (continuous_agg.c:935)
==00:00:00:27.294 3104==    by 0xF88CE80: process_grant_and_revoke (process_utility.c:1620)
==00:00:00:27.294 3104==    by 0xF88CBB0: process_ddl_command_start (process_utility.c:4214)
==00:00:00:27.294 3104==    by 0xF88CBB0: timescaledb_ddl_command_start (process_utility.c:4457)
==00:00:00:27.294 3104==    by 0x5DDB40: PortalRunUtility (pquery.c:1158)
==00:00:00:27.294 3104==    by 0x5DDC7C: PortalRunMulti (pquery.c:1315)
==00:00:00:27.294 3104==    by 0x5DE320: PortalRun (pquery.c:791)
==00:00:00:27.294 3104==  Address 0xa17c67c is in a rw- anonymous segment
==00:00:00:27.294 3104== 
==00:00:00:27.294 3104== VALGRINDERROR-END

==00:00:00:27.294 3104== VALGRINDERROR-BEGIN
==00:00:00:27.294 3104== Invalid read of size 1
==00:00:00:27.294 3104==    at 0x484ED24: strlen (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==00:00:00:27.294 3104==    by 0x6F6E9C: namehashfast (catcache.c:144)
==00:00:00:27.294 3104==    by 0x6F6A30: CatalogCacheComputeHashValue (catcache.c:297)
==00:00:00:27.294 3104==    by 0x6F7B14: SearchCatCacheInternal (catcache.c:1233)
==00:00:00:27.294 3104==    by 0x70C690: SearchSysCache (syscache.c:1169)
==00:00:00:27.294 3104==    by 0x70C690: GetSysCacheOid (syscache.c:1293)
==00:00:00:27.294 3104==    by 0x35E4CA: RangeVarGetRelidExtended (namespace.c:331)
==00:00:00:27.294 3104==    by 0xF8A74FD: ts_continuous_agg_find_by_rv (continuous_agg.c:935)
==00:00:00:27.294 3104==    by 0xF88CE80: process_grant_and_revoke (process_utility.c:1620)
==00:00:00:27.294 3104==    by 0xF88CBB0: process_ddl_command_start (process_utility.c:4214)
==00:00:00:27.294 3104==    by 0xF88CBB0: timescaledb_ddl_command_start (process_utility.c:4457)
==00:00:00:27.294 3104==    by 0x5DDB40: PortalRunUtility (pquery.c:1158)
==00:00:00:27.294 3104==    by 0x5DDC7C: PortalRunMulti (pquery.c:1315)
==00:00:00:27.294 3104==    by 0x5DE320: PortalRun (pquery.c:791)
==00:00:00:27.294 3104==  Address 0xa17c67d is in a rw- anonymous segment
==00:00:00:27.294 3104== 
==00:00:00:27.294 3104== VALGRINDERROR-END

How did we find this? Our EventTrigger hook has found some garbage in relname.

Why does this happen? When you copy the address of &((Form_pg_class) GETSTRUCT(tuple))->relname in function process_relations_in_namespace in file src/process_utility.c, you copy an address within some buffer in shared memory that right now contains page with needed information. But when, inevitably, a page is unloaded from buffer and another one is loaded in it's place, the pointer is pointing to another, unrelated data.

The second patch is aimed at fixing those pointers that point to already-switched pages.

memfix_1.patch
memfix_2.patch

TimescaleDB version affected

2.9.2

PostgreSQL version used

15

What operating system did you use?

Ubuntu 22.04

What installation method did you use?

Source

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

No response

How can we reproduce the bug?

The reproduction is very simple:

1) Configure & install PostgreSQL (use configure with 'CPPFLAGS=-DUSE_VALGRIND' to allow the use of valgrind) 

2) Configure & install timescaledb

3) Run initdb

4) In postgresql.conf add timescaledb to shared_preload_libraries

5) Start PostgreSQL under valgrind:


valgrind --leak-check=no --gen-suppressions=all     --suppressions=src/tools/valgrind.supp --time-stamp=yes \
--error-markers=VALGRINDERROR-BEGIN,VALGRINDERROR-END     --log-file=/home/postgres/valgrind_log.log --trace-children=yes \
/your/bin/dir/postgres --log_line_prefix="%m %p "     --log_statement=all --shared_buffers=64MB \
-D /your/data/dir 2>&1 | tee /your/log/dir/valgrind_postmaster.log

Execute those commands (with psql):

CREATE EXTENSION timescaledb;
	
CREATE ROLE test;
	
GRANT SELECT ON ALL TABLES IN SCHEMA information_schema TO test;

The text was updated successfully, but these errors were encountered:

kgyrtkirk · 2023-02-13T14:19:26Z

thank you for reporting this issue - I think I also see the same invalid read detected by valgrind; however I find it odd that this could be cause by strlcpy as it supposed to be taking care of null termination:
https://github.com/postgres/postgres/blob/master/src/port/strlcpy.c#L38-L45

I've applied your patches and it seems to me that #2 fixes the issues reported by valgrind! :)

Could you please submit your proposed changes in a PR against the repo?
You might find CONTRIBUTING.md usefull

Medvecrab · 2023-02-13T14:44:04Z

Thanks for a quick reply!

The thing about the first patch - I use strncpy to specifically fill the entire length of 'buffer' (which is key_data in this case) with zeroes - this fixes the problem that valgrind reports after executing just CREATE EXTENSION timescaledb;

As you've said strlcpy deals with null termination, but the goal of the first patch is to get rid of the junk such as seen here (from initial issue description):

I'll make a pull request tomorrow, after reading the contributing guide.

While running TimescaleDB under valgrind I've found two cases of incorrect memory handling. Case 1: When creating timescaledb extension, during the insertion of metadata there is some junk in memory that is not zeroed before writing there. Changes in metadata.c fix this. Case 2: When executing GRANT smth ON ALL TABLES IN SCHEMA some_schema and deconstructing this statement into granting to individual tables, process of copying names of those tables is wrong. Currently, you aren't copying the data itself, but an address to data on a page in some buffer. There's a problem - when the page in this buffer changes, copied address would lead to wrong data. Changes in process_utility.c fix this by allocating memory and then copying needed relname there. Fixes timescale#5311

While running TimescaleDB under valgrind I've found two cases of incorrect memory handling. Case 1: When creating timescaledb extension, during the insertion of metadata there is some junk in memory that is not zeroed before writing there. Changes in metadata.c fix this. Case 2: When executing GRANT smth ON ALL TABLES IN SCHEMA some_schema and deconstructing this statement into granting to individual tables, process of copying names of those tables is wrong. Currently, you aren't copying the data itself, but an address to data on a page in some buffer. There's a problem - when the page in this buffer changes, copied address would lead to wrong data. Changes in process_utility.c fix this by allocating memory and then copying needed relname there. Fixes #5311

While running TimescaleDB under valgrind I've found two cases of incorrect memory handling. Case 1: When creating timescaledb extension, during the insertion of metadata there is some junk in memory that is not zeroed before writing there. Changes in metadata.c fix this. Case 2: When executing GRANT smth ON ALL TABLES IN SCHEMA some_schema and deconstructing this statement into granting to individual tables, process of copying names of those tables is wrong. Currently, you aren't copying the data itself, but an address to data on a page in some buffer. There's a problem - when the page in this buffer changes, copied address would lead to wrong data. Changes in process_utility.c fix this by allocating memory and then copying needed relname there. Fixes #5311 (cherry picked from commit 0746517)

While running TimescaleDB under valgrind I've found two cases of incorrect memory handling. Case 1: When creating timescaledb extension, during the insertion of metadata there is some junk in memory that is not zeroed before writing there. Changes in metadata.c fix this. Case 2: When executing GRANT smth ON ALL TABLES IN SCHEMA some_schema and deconstructing this statement into granting to individual tables, process of copying names of those tables is wrong. Currently, you aren't copying the data itself, but an address to data on a page in some buffer. There's a problem - when the page in this buffer changes, copied address would lead to wrong data. Changes in process_utility.c fix this by allocating memory and then copying needed relname there. Fixes timescale#5311 (cherry picked from commit 0746517)

While running TimescaleDB under valgrind I've found two cases of incorrect memory handling. Case 1: When creating timescaledb extension, during the insertion of metadata there is some junk in memory that is not zeroed before writing there. Changes in metadata.c fix this. Case 2: When executing GRANT smth ON ALL TABLES IN SCHEMA some_schema and deconstructing this statement into granting to individual tables, process of copying names of those tables is wrong. Currently, you aren't copying the data itself, but an address to data on a page in some buffer. There's a problem - when the page in this buffer changes, copied address would lead to wrong data. Changes in process_utility.c fix this by allocating memory and then copying needed relname there. Fixes #5311 (cherry picked from commit 0746517)

Medvecrab added the bug label Feb 13, 2023

Medvecrab mentioned this issue Feb 14, 2023

Fix some incorrect memory handling #5316

Closed

Medvecrab mentioned this issue Feb 14, 2023

Fix some incorrect memory handling #5317

Merged

mkindahl closed this as completed in #5317 Feb 20, 2023

timescale-automation mentioned this issue Mar 1, 2023

Backport to 2.10.x: #5317: Fix some incorrect memory handling #5372

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Some incorrect memory handling. #5311

[Bug]: Some incorrect memory handling. #5311

Medvecrab commented Feb 13, 2023 •

edited

kgyrtkirk commented Feb 13, 2023

Medvecrab commented Feb 13, 2023

[Bug]: Some incorrect memory handling. #5311

[Bug]: Some incorrect memory handling. #5311

Comments

Medvecrab commented Feb 13, 2023 • edited

What type of bug is this?

What subsystems and features are affected?

What happened?

TimescaleDB version affected

PostgreSQL version used

What operating system did you use?

What installation method did you use?

What platform did you run on?

Relevant log output and stack trace

How can we reproduce the bug?

kgyrtkirk commented Feb 13, 2023

Medvecrab commented Feb 13, 2023

Medvecrab commented Feb 13, 2023 •

edited