Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postgres #415

Merged
merged 159 commits into from
May 6, 2023
Merged

Postgres #415

merged 159 commits into from
May 6, 2023

Conversation

pnsafonov
Copy link

PostgreSQL files decoder.
Implemented formats: pg_control, pg_heap.

@wader
Copy link
Owner

wader commented Nov 26, 2022

I removed WAL file parsing. I failed to implement it.

Ok, what part made it hard? i know for sqlite i got stuck at having to traverse a btree in the decode to know which pages are used, is possible to do but quite complicated.

@pnsafonov
Copy link
Author

, what part made it hard? i know for sqlite i got stuck at having to traverse a btree in the decode to know which pages are used, is possible to do bu

Postgres WAL is devided into 8kb pages. Each page contains header, next are records.
Record can be divided into multiple pages. Postgres reconstructs each record in separate buffer when read WAL.
WAL record is not continuous. It can split uint32: 3 bytes on current page, 1 byte on the next page (page header between 3 and 1 bytes).

It is complicated to read split uint32 (ulong64, string and etc).

@wader
Copy link
Owner

wader commented Dec 1, 2022

, what part made it hard? i know for sqlite i got stuck at having to traverse a btree in the decode to know which pages are used, is possible to do bu

Postgres WAL is devided into 8kb pages. Each page contains header, next are records. Record can be divided into multiple pages. Postgres reconstructs each record in separate buffer when read WAL. WAL record is not continuous. It can split uint32: 3 bytes on current page, 1 byte on the next page (page header between 3 and 1 bytes).

It is complicated to read split uint32 (ulong64, string and etc).

Aha i see, if i understand correctly one has to reassemble WAL records from pages? the fq decode API do support to do things like this. You can collect bytes into a bytes buffer and then decode it as a "sub" buffer. For example see the ogg decoder which read pages that contains segments of packets, the segments are demuxed into continuous packet buffers and then decoded. Other examples is is zip that decodes uncompressed data or the pcap decoder that reassemble tcp streams and decodes.

Would something like that work?

Copy link
Owner

@wader wader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is anyone still working/using this? would be great to get it merged in some form. Maybe we can try split into a PR with basic things are work and one with things not so clear yet?

format/format.go Outdated
@@ -109,6 +109,9 @@ const (
OPUS_PACKET = "opus_packet"
PCAP = "pcap"
PCAPNG = "pcapng"
PG_BTREE = "pg_btree"
PG_CONTROL = "pg_control"
PG_HEAP = "pg_heap"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, i did quite large refactor how formats are registered to allow some new things in the futue (probe via schema, filename etc). To make it build again change to:

	PgBtree            = &decode.Group{Name: "pg_btree"}
	PgControl          = &decode.Group{Name: "pg_control"}
	PgHeap             = &decode.Group{Name: "pg_heap"}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's done

RootName: "pages",
})
interp.RegisterFS(pgBTreeFS)
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to:

	interp.RegisterFormat(
                 format.PgBtree,
                 &decode.Format{
        		Description: "PostgreSQL btree index file",
        		DecodeFn:    decodePgBTree,
        		DecodeInArg: format.PostgresBTreeIn{
        			Page: 0,
        		},
        		RootArray: true,
        		RootName:  "pages",
        	})

Notice now that maybe PostgresBTreeIn should be renamed to PgBTreeIn to match the format name more?

},
RootArray: true,
RootName: "pages",
})
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment about pg_btree.go what to change, also same about PostgresHeapIn i guess, maybe rename to match format name

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's done

DecodeInArg: format.PostgresIn{
Flavour: "",
},
})
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment about pg_btree.go. Also wonder if PostgresIn whould be PgControlIn here? the arg struct type is usually per format and not shared even if having same members, think it makes sense to keep things decoupled if things change.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's done

@pnsafonov
Copy link
Author

Is anyone still working/using this? would be great to get it merged in some form. Maybe we can try split into a PR with basic things are work and one with things not so clear yet?

I am working on PostgreSQL decoder. I will try to make pull request ready for merge.

@wader
Copy link
Owner

wader commented May 2, 2023

@pnsafonov That would be great 👍

BTW have you been able to use this decode in practice? like debug corruption or for education etc?

@pnsafonov
Copy link
Author

BTW have you been able to use this decode in practice? like debug corruption or for education etc?

Yes, i have some experience of PostgreSQL files corruption analysis with fq. In addition it's good for visualizing data.

@wader
Copy link
Owner

wader commented May 2, 2023

@pnsafonov Ok! nice to hear

@pnsafonov
Copy link
Author

I have synchronized postgres with master. Fix conflicts.
CI is ok now, all checks are passed.

Copy link
Owner

@wader wader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this again. Did some reviewing and will probably do some more

btw you can run make doc to update doc/formats.md with pg formats and help texts.

README.md Outdated
[pg_btree](doc/formats.md#pg_btree),
[pg_control](doc/formats.md#pg_control),
[pg_heap](doc/formats.md#pg_heap),
[pg_wal](doc/formats.md#pg_wal),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pg_wal was removed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, pg_wal is removed.

format/format.go Outdated
Flavour string `doc:"PostgreSQL flavour: postgres14, pgproee14.., postgres10"`
}

type PostgresHeapIn struct {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name convention i've settle on for in/out struct is <format-name>_In/Out where format name is camel case and same underscoring as the format name, so Pg_Heap_In

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's done.

format/format.go Outdated
@@ -112,6 +112,9 @@ var (
Opus_Packet = &decode.Group{Name: "opus_packet"}
PCAP = &decode.Group{Name: "pcap"}
PCAPNG = &decode.Group{Name: "pcapng"}
PG_BTREE = &decode.Group{Name: "pg_btree"}
PG_CONTROL = &decode.Group{Name: "pg_control"}
PG_HEAP = &decode.Group{Name: "pg_heap"}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To follow reset of naming these would be Pg_Heap or PG_Heap and so on

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's done.

d.SeekAbs(0)

pgProVersion, oriVersion := common.ParsePgProVersion(uint32(pgControlVersion))

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In below switches version 15 is not needed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Versions 13, 14, 15 have the same constant 1300. I can't detect version clearly. So I use version 14.


var pgIn format.PgControlIn
if !d.ArgAs(&pgIn) {
d.Fatalf("DecodeInArg must be PgControlIn!\n")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe update error message, something like "no flavour specified"? btw is it possible to probe flavour if not set?

There are some other d.ArgAs cases where message need update

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's done.

pos0 := page.BytesPosBegin * 8
d.SeekAbs(pos0)

if end, _ := d.TryEnd(); end {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think d.SeekAbs will panic and abort decode if pos is invalid so maybe can just be if d.End()? or was there some reason TryEnd was used?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's done.

pos2 := d.Pos()
bytesPos2 := pos2 / 8
if bytesPos2 != page.BytesPosEnd {
d.Fatalf("invalid pos after read page_opaque_data on meta page\n")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can skip \n in fatal end error messages

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's done.

d.FieldValueBool("has_garbage", hasGarbage)
d.FieldValueBool("is_incomplete_split", isIncompleteSplit)
d.FieldValueBool("has_full_xid", hasFullXid)
})
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think i commend about this before but don't remember what the conclusion was: Would it make sense to decode so that there will be fields for each "real" bit or it does not make sense? i look like most flags are mapped directly to one bit but some are "derived" like isIgnore

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It useful when analysing data such as table rows or indexes. I am searching for anomalies and check every flag. I am not so good in applying masks in mind. I better understand boolean in text.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think i maybe explain it badly, what i mean is something like:

func decodeBTPageOpaqueData(d *decode.D) {
	prev := d.FieldU32("btpo_prev")
	next := d.FieldU32("btpo_next")

	d.FieldU32("btpo_level")
	d.FieldStruct("btpo_flags", func(d *decode.D) {
		d.FieldBool("has_full_xid")
		d.FieldBool("is_incomplete_split")
		d.FieldBool("has_garbage")
		isHalfDead := d.FieldBool("is_half_dead")
		d.FieldBool("is_meta")
		isDeleted := d.FieldBool("is_deleted")
		d.FieldBool("is_root")
		d.FieldBool("is_leaf")
		d.FieldU8("unused0")

		isLeftMost := prev == P_NONE
		isRightMost := next == P_NONE
		isIgnore := isDeleted || isHalfDead

		d.FieldValueBool("is_leftmost", isLeftMost)
		d.FieldValueBool("is_rightmost", isRightMost)
		d.FieldValueBool("is_ignore", isIgnore)
	})

	d.FieldU16("btpo_cycleid")
}

The decode tree diff looks like this:

        --- expected
        +++ actual
        @@ -30,18 +30,18 @@
         0x1ff0|00 00 00 00                                    |....            |    btpo_prev: 0
         0x1ff0|            00 00 00 00                        |    ....        |    btpo_next: 0
         0x1ff0|                        00 00 00 00            |        ....    |    btpo_level: 0
        -0x1ff0|                                    08 00      |            ..  |    btpo_flags: 8
        -0x1ff0|                                          00 00|              ..|    btpo_cycleid: 0
               |                                               |                |    flags{}:
        +0x1ff0|                                    08         |            .   |      has_full_xid: false
        +0x1ff0|                                    08         |            .   |      is_incomplete_split: false
        +0x1ff0|                                    08         |            .   |      has_garbage: false
        +0x1ff0|                                    08         |            .   |      is_half_dead: false
        +0x1ff0|                                    08         |            .   |      is_meta: true
        +0x1ff0|                                    08         |            .   |      is_deleted: false
        +0x1ff0|                                    08         |            .   |      is_root: false
        +0x1ff0|                                    08         |            .   |      is_leaf: false
        +0x1ff0|                                       00      |             .  |      unused0: 0
               |                                               |                |      is_leftmost: true
               |                                               |                |      is_rightmost: true
        -      |                                               |                |      is_leaf: false
        -      |                                               |                |      is_root: false
        -      |                                               |                |      is_deleted: false
        -      |                                               |                |      is_meta: true
        -      |                                               |                |      is_half_dead: false
               |                                               |                |      is_ignore: false
        -      |                                               |                |      has_garbage: false
        -      |                                               |                |      is_incomplete_split: false
        -      |                                               |                |      has_full_xid: false
        +0x1ff0|                                          00 00|              ..|    btpo_cycleid: 0

So all fields are still there but now are boolean that map to actual bit ranges in the file except "is_ignore" which is "synthetic". This would also enable in future version of to show bit level tree somehow and possibly also syntax to change individual bits.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTP_LEAF             = 1 << 0 /* leaf page, i.e. not internal page */
BTP_ROOT             = 1 << 1 /* root page (has no parent) */
BTP_DELETED          = 1 << 2 /* page has been deleted from tree */
BTP_META             = 1 << 3 /* meta-page */
BTP_HALF_DEAD        = 1 << 4 /* empty, but still in tree */
BTP_SPLIT_END        = 1 << 5 /* rightmost page of split group */
BTP_HAS_GARBAGE      = 1 << 6 /* page has LP_DEAD tuples (deprecated) */
BTP_INCOMPLETE_SPLIT = 1 << 7 /* right sibling's downlink is missing */
BTP_HAS_FULLXID      = 1 << 8 /* contains BTDeletedPageData */

Bits in uin16 on LE:
7 - 0 15 - 8

Read each mask of uint16 on little endian by using bitstream is complex to undestand:

d.FieldStruct("btpo_flags", func(d *decode.D) {
	d.FieldBool("is_incomplete_split")
	d.FieldBool("has_garbage")
	d.FieldBool("split_end")
	isHalfDead := d.FieldBool("is_half_dead")
	d.FieldBool("is_meta")
	isDeleted := d.FieldBool("is_deleted")
	d.FieldBool("is_root")
	d.FieldBool("is_leaf")

	d.FieldU7("skip1")
	d.FieldBool("has_full_xid")

	d.FieldValueBool("is_ignore", isDeleted || isHalfDead)
})

Order of fields is not intuitive and is hard coded for little endian.

Postgres in source uses mask to read this fields:
#define P_ISLEAF(opaque) (((opaque)->btpo_flags & BTP_LEAF) != 0)
#define P_ISROOT(opaque) (((opaque)->btpo_flags & BTP_ROOT) != 0)

But more than 99.9% of machines uses now little endian. And there is not reason to not hardcode.

Better approach would be not to use masks, but to use bitstream (FieldBool)?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way works for me, think i let you decide what is best, and we can always change it later if tracking bit rangres per flag is useful etc.

I've thought about adding some kind of helper to the decode API to make it more convenient to produce FieldBool and FieldU/S etc from an endian "aware" bitstream decoder somehow, maybe something similar to how kaitai struct does "bit-endian" (https://doc.kaitai.io/user_guide.html#_bit_sized_integers see "4.14.2. Little-endian order").

But generally i'm fine with having the decode code being a bit ugly or weird as long it produces a nicer decode output to the user but maybe that is not the case here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed to bitstream (FieldBool).

blockNumber := uint32(heap.Args.Page + heap.Args.Segment*common.RelSegSize)
count := int64(0)
for {
if end, _ := d.TryEnd(); end {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the other comment about SeekAbs/TryEnd, can be just End()?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's done.

}

type BTree struct {
Args format.Pg_BTree_In
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like only Args.Page is used, could possibly decouple a bit by accepting page int as arg for DecodePgBTree

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this can be refactor like this:

diff --git a/format/postgres/common/pg_btree/postgres/pg_btree.go b/format/postgres/common/pg_btree/postgres/pg_btree.go
index 5ae586c5..be387438 100644
--- a/format/postgres/common/pg_btree/postgres/pg_btree.go
+++ b/format/postgres/common/pg_btree/postgres/pg_btree.go
@@ -1,7 +1,6 @@
 package postgres
 
 import (
-	"github.com/wader/fq/format"
 	"github.com/wader/fq/format/postgres/common"
 	"github.com/wader/fq/format/postgres/common/pg_heap/postgres"
 	"github.com/wader/fq/pkg/decode"
@@ -65,30 +64,17 @@ const (
 // IndexTupleData *IndexTuple;
 /* total size (bytes):    8 */
 
-func DecodePgBTree(d *decode.D, args format.Pg_BTree_In) any {
-	btree := &BTree{
-		Args:     args,
-		PageSize: common.PageSize,
-	}
-	decodeBTreePages(btree, d)
-	return nil
-}
+func DecodePgBTree(d *decode.D, pageNr int) {
+	var prevPage *postgres.HeapPage
 
-type BTree struct {
-	Args     format.Pg_BTree_In
-	PageSize uint64
-	page     *postgres.HeapPage
-}
-
-func decodeBTreePages(btree *BTree, d *decode.D) {
-	for i := btree.Args.Page; ; i++ {
+	for i := pageNr; ; i++ {
 		page := &postgres.HeapPage{}
-		if btree.page != nil {
+		if prevPage != nil {
 			// use prev page
-			page.BytesPosBegin = btree.page.BytesPosEnd
+			page.BytesPosBegin = prevPage.BytesPosEnd
 		}
-		page.BytesPosEnd = int64(common.TypeAlign(btree.PageSize, uint64(page.BytesPosBegin)+1))
-		btree.page = page
+		page.BytesPosEnd = int64(common.TypeAlign(common.PageSize, uint64(page.BytesPosBegin)+1))
+		prevPage = page
 
 		pos0 := page.BytesPosBegin * 8
 		d.SeekAbs(pos0)
@@ -100,19 +86,18 @@ func decodeBTreePages(btree *BTree, d *decode.D) {
 		if i == 0 {
 			// first page contains meta information
 			d.FieldStruct("page", func(d *decode.D) {
-				decodeBTreeMetaPage(btree, d)
+				decodeBTreeMetaPage(page, d)
 			})
 			continue
 		}
 
 		d.FieldStruct("page", func(d *decode.D) {
-			decodeBTreePage(btree, d)
+			decodeBTreePage(page, d)
 		})
 	}
 }
 
-func decodeBTreeMetaPage(btree *BTree, d *decode.D) {
-	page := btree.page
+func decodeBTreeMetaPage(page *postgres.HeapPage, d *decode.D) {
 
 	d.FieldStruct("page_header", func(d *decode.D) {
 		postgres.DecodePageHeader(page, d)
@@ -122,7 +107,7 @@ func decodeBTreeMetaPage(btree *BTree, d *decode.D) {
 	})
 
 	pos0 := d.Pos()
-	pos1 := btree.page.BytesPosSpecial * 8
+	pos1 := page.BytesPosSpecial * 8
 	d.FieldRawLen("unused0", pos1-pos0)
 	d.FieldStruct("page_opaque_data", func(d *decode.D) {
 		decodeBTPageOpaqueData(d)
@@ -204,15 +189,13 @@ func decodeBTPageOpaqueData(d *decode.D) {
 	})
 }
 
-func decodeBTreePage(btree *BTree, d *decode.D) {
-	page := btree.page
-
+func decodeBTreePage(page *postgres.HeapPage, d *decode.D) {
 	d.FieldStruct("page_header", func(d *decode.D) {
 		postgres.DecodePageHeader(page, d)
 	})
 
 	pos0 := d.Pos()
-	pos1 := btree.page.BytesPosSpecial * 8
+	pos1 := page.BytesPosSpecial * 8
 	d.SeekAbs(pos1)
 	d.FieldStruct("page_opaque_data", func(d *decode.D) {
 		decodeBTPageOpaqueData(d)
@@ -227,12 +210,11 @@ func decodeBTreePage(btree *BTree, d *decode.D) {
 	postgres.DecodeItemIds(page, d)
 
 	d.FieldArray("tuples", func(d *decode.D) {
-		decodeIndexTuples(btree, d)
+		decodeIndexTuples(page, d)
 	})
 }
 
-func decodeIndexTuples(btree *BTree, d *decode.D) {
-	page := btree.page
+func decodeIndexTuples(page *postgres.HeapPage, d *decode.D) {
 
 	for i := 0; i < len(page.ItemIds); i++ {
 		id := page.ItemIds[i]
diff --git a/format/postgres/pg_btree.go b/format/postgres/pg_btree.go
index 5a53aa42..9df49ac4 100644
--- a/format/postgres/pg_btree.go
+++ b/format/postgres/pg_btree.go
@@ -32,5 +32,6 @@ func decodePgBTree(d *decode.D) any {
 	if !d.ArgAs(&pgIn) {
 		d.Fatalf("no page specified")
 	}
-	return postgres.DecodePgBTree(d, pgIn)
+	postgres.DecodePgBTree(d, pgIn.Page)
+	return nil
 }

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patch applied.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 maybe an older version had more things in BTree?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, nothing to add.

```

### References
- https://www.postgresql.org/docs/current/storage-page-layout.html
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want you can add a authors section to all the *.md files, see bson.md etc (and do make doc afterwards)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's done.

@wader
Copy link
Owner

wader commented May 5, 2023

I think it's looking good. Is there something more you would like to add or change?

I might want to do some changes but i think it's going to be easier to merge this and then i do PR:s and ping you to approve and feedback.

@pnsafonov
Copy link
Author

I think it's looking good. Is there something more you would like to add or change?

I might want to do some changes but i think it's going to be easier to merge this and then i do PR:s and ping you to approve and feedback.

I have nothing to add now. I think it's ready for merge.

@wader
Copy link
Owner

wader commented May 6, 2023

👍 Great!

@wader wader merged commit a200d3e into wader:master May 6, 2023
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants