Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apple bookmarkData #493

Merged
merged 47 commits into from
Dec 10, 2022
Merged

Apple bookmarkData #493

merged 47 commits into from
Dec 10, 2022

Conversation

dgmcdona
Copy link
Contributor

This PR implements a decoder for macOS/iOS bookmarkData blobs, which are often found within binary plist files. These are used to resolve URL objects for a file, even if the user moves or renames it.

There are some issues that I could use help ironing out, namely getting it to work properly as a nested format using grep_by or select in recursive jq expressions. I think there may be something wrong with the way torepr and tobytes are working in the bplist decoder. For instance this does not work in the way that I would think it would (does not yield any results):

fq '.. | select(format=="bookmark") | .map(. | torepr)' com.apple.LSSharedFileList.RecentApplications.sfl2

In this case, when a bplist data type is passed to tobytes, the bytes output are the truncated base64 representation:

./fq 'torepr."$objects"[15] | tobytes | bookmark' ~/Library/Application\ Support/com.apple.sharedfilelist/com.apple.LSSharedFileList.RecentApplications.sfl2

@wader
Copy link
Owner

wader commented Nov 27, 2022

This PR implements a decoder for macOS/iOS bookmarkData blobs, which are often found within binary plist files. These are used to resolve URL objects for a file, even if the user moves or renames it.

Nice!

There are some issues that I could use help ironing out, namely getting it to work properly as a nested format using grep_by or select in recursive jq expressions. I think there may be something wrong with the way torepr and tobytes are working in the bplist decoder. For instance this does not work in the way that I would think it would (does not yield any results):

fq '.. | select(format=="bookmark") | .map(. | torepr)' com.apple.LSSharedFileList.RecentApplications.sfl2

Aha yes format is fq specific function that checks if a value is a decode "root" and if so return a string of the name of the decode, so in short grep_by(format=="jpeg")) would find all nested jpeg root values.

Had a quick look, maybe something like this would work?

fq 'grep_by(.type=="data" and .value[0:4] == "book") | bookmark | torepr' ...

In this case, when a bplist data type is passed to tobytes, the bytes output are the truncated base64 representation:

./fq 'torepr."$objects"[15] | tobytes | bookmark' ~/Library/Application\ Support/com.apple.sharedfilelist/com.apple.LSSharedFileList.RecentApplications.sfl2

I think the problem might be that by default fq currently turns "raw" fields into truncated "" strings when represented as a jq value, jq has no binary type unfortunately. Maybe have to rethink that and also i think maybe gojq had some changes that might have made strings "binary safe".

But there is an option to change how raw fields should be represented, try fq -o bits_format=string torepr ..., valid values are md5, base64, truncate, string and snippet (defualt). Maybe bits_format shoud be renamed to raw_format?

Nice work, i will review more tomorrow!

format/format.go Outdated Show resolved Hide resolved
format/bookmark/bookmark.go Outdated Show resolved Hide resolved
format/bookmark/bookmark.go Outdated Show resolved Hide resolved
format/bookmark/bookmark.go Outdated Show resolved Hide resolved
format/bookmark/bookmark.md Outdated Show resolved Hide resolved
format/bookmark/bookmark.md Outdated Show resolved Hide resolved
@wader
Copy link
Owner

wader commented Nov 28, 2022

Had look at binary safe strings in jq and gojq. jq is not safe (replaces invalid code points with "REPLACEMENT CHARACTER") but gojq is. Seems i remembered it wrong, i think maybe i confused with gojq removed support for \x## escapes.

$ fq -n '[range(256)] | tobytes' | gojq -jrRs . | hexdump -C
00000000  00 01 02 03 04 05 06 07  08 09 0a 0b 0c 0d 0e 0f  |................|
00000010  10 11 12 13 14 15 16 17  18 19 1a 1b 1c 1d 1e 1f  |................|
00000020  20 21 22 23 24 25 26 27  28 29 2a 2b 2c 2d 2e 2f  | !"#$%&'()*+,-./|
00000030  30 31 32 33 34 35 36 37  38 39 3a 3b 3c 3d 3e 3f  |0123456789:;<=>?|
00000040  40 41 42 43 44 45 46 47  48 49 4a 4b 4c 4d 4e 4f  |@ABCDEFGHIJKLMNO|
00000050  50 51 52 53 54 55 56 57  58 59 5a 5b 5c 5d 5e 5f  |PQRSTUVWXYZ[\]^_|
00000060  60 61 62 63 64 65 66 67  68 69 6a 6b 6c 6d 6e 6f  |`abcdefghijklmno|
00000070  70 71 72 73 74 75 76 77  78 79 7a 7b 7c 7d 7e 7f  |pqrstuvwxyz{|}~.|
00000080  80 81 82 83 84 85 86 87  88 89 8a 8b 8c 8d 8e 8f  |................|
00000090  90 91 92 93 94 95 96 97  98 99 9a 9b 9c 9d 9e 9f  |................|
000000a0  a0 a1 a2 a3 a4 a5 a6 a7  a8 a9 aa ab ac ad ae af  |................|
000000b0  b0 b1 b2 b3 b4 b5 b6 b7  b8 b9 ba bb bc bd be bf  |................|
000000c0  c0 c1 c2 c3 c4 c5 c6 c7  c8 c9 ca cb cc cd ce cf  |................|
000000d0  d0 d1 d2 d3 d4 d5 d6 d7  d8 d9 da db dc dd de df  |................|
000000e0  e0 e1 e2 e3 e4 e5 e6 e7  e8 e9 ea eb ec ed ee ef  |................|
000000f0  f0 f1 f2 f3 f4 f5 f6 f7  f8 f9 fa fb fc fd fe ff  |................|
00000100
$ fq -n '[range(256)] | tobytes' | jq -jrRs . | hexdump -C
00000000  00 01 02 03 04 05 06 07  08 09 0a 0b 0c 0d 0e 0f  |................|
00000010  10 11 12 13 14 15 16 17  18 19 1a 1b 1c 1d 1e 1f  |................|
00000020  20 21 22 23 24 25 26 27  28 29 2a 2b 2c 2d 2e 2f  | !"#$%&'()*+,-./|
00000030  30 31 32 33 34 35 36 37  38 39 3a 3b 3c 3d 3e 3f  |0123456789:;<=>?|
00000040  40 41 42 43 44 45 46 47  48 49 4a 4b 4c 4d 4e 4f  |@ABCDEFGHIJKLMNO|
00000050  50 51 52 53 54 55 56 57  58 59 5a 5b 5c 5d 5e 5f  |PQRSTUVWXYZ[\]^_|
00000060  60 61 62 63 64 65 66 67  68 69 6a 6b 6c 6d 6e 6f  |`abcdefghijklmno|
00000070  70 71 72 73 74 75 76 77  78 79 7a 7b 7c 7d 7e 7f  |pqrstuvwxyz{|}~.|
00000080  ef bf bd ef bf bd ef bf  bd ef bf bd ef bf bd ef  |................|
00000090  bf bd ef bf bd ef bf bd  ef bf bd ef bf bd ef bf  |................|
000000a0  bd ef bf bd ef bf bd ef  bf bd ef bf bd ef bf bd  |................|
000000b0  ef bf bd ef bf bd ef bf  bd ef bf bd ef bf bd ef  |................|
000000c0  bf bd ef bf bd ef bf bd  ef bf bd ef bf bd ef bf  |................|
000000d0  bd ef bf bd ef bf bd ef  bf bd ef bf bd ef bf bd  |................|
000000e0  ef bf bd ef bf bd ef bf  bd ef bf bd ef bf bd ef  |................|
000000f0  bf bd ef bf bd ef bf bd  ef bf bd ef bf bd ef bf  |................|
00000100  bd ef bf bd ef bf bd ef  bf bd ef bf bd ef bf bd  |................|
00000110  ef bf bd ef bf bd ef bf  bd ef bf bd ef bf bd ef  |................|
00000120  bf bd ef bf bd ef bf bd  ef bf bd ef bf bd ef bf  |................|
00000130  bd ef bf bd ef bf bd ef  bf bd ef bf bd ef bf bd  |................|
00000140  ef bf bd ef bf bd ef bf  bd ef bf bd ef bf bd ef  |................|
00000150  bf bd ef bf bd ef bf bd  ef bf bd ef bf bd ef bf  |................|
00000160  bd ef bf bd ef bf bd ef  bf bd ef bf bd ef bf bd  |................|
00000170  ef bf bd ef bf bd ef bf  bd ef bf bd ef bf bd ef  |................|
00000180  bf bd ef bf bd ef bf bd  ef bf bd ef bf bd ef bf  |................|
00000190  bd ef bf bd ef bf bd ef  bf bd ef bf bd ef bf bd  |................|
000001a0  ef bf bd ef bf bd ef bf  bd ef bf bd ef bf bd ef  |................|
000001b0  bf bd ef bf bd ef bf bd  ef bf bd ef bf bd ef bf  |................|
000001c0  bd ef bf bd ef bf bd ef  bf bd ef bf bd ef bf bd  |................|
000001d0  ef bf bd ef bf bd ef bf  bd ef bf bd ef bf bd ef  |................|
000001e0  bf bd ef bf bd ef bf bd  ef bf bd ef bf bd ef bf  |................|
000001f0  bd ef bf bd ef bf bd ef  bf bd ef bf bd ef bf bd  |................|
00000200

But maybe as expected in gojq the bytes will not survive a JSON run-trip, could maybe be worked around in fq but then would not be standard compliant i guess.

fq -n '[range(256)] | tobytes' | gojq -jrRs 'tojson | fromjson' | hexdump -C
00000000  00 01 02 03 04 05 06 07  08 09 0a 0b 0c 0d 0e 0f  |................|
00000010  10 11 12 13 14 15 16 17  18 19 1a 1b 1c 1d 1e 1f  |................|
00000020  20 21 22 23 24 25 26 27  28 29 2a 2b 2c 2d 2e 2f  | !"#$%&'()*+,-./|
00000030  30 31 32 33 34 35 36 37  38 39 3a 3b 3c 3d 3e 3f  |0123456789:;<=>?|
00000040  40 41 42 43 44 45 46 47  48 49 4a 4b 4c 4d 4e 4f  |@ABCDEFGHIJKLMNO|
00000050  50 51 52 53 54 55 56 57  58 59 5a 5b 5c 5d 5e 5f  |PQRSTUVWXYZ[\]^_|
00000060  60 61 62 63 64 65 66 67  68 69 6a 6b 6c 6d 6e 6f  |`abcdefghijklmno|
00000070  70 71 72 73 74 75 76 77  78 79 7a 7b 7c 7d 7e 7f  |pqrstuvwxyz{|}~.|
00000080  ef bf bd ef bf bd ef bf  bd ef bf bd ef bf bd ef  |................|
00000090  bf bd ef bf bd ef bf bd  ef bf bd ef bf bd ef bf  |................|
000000a0  bd ef bf bd ef bf bd ef  bf bd ef bf bd ef bf bd  |................|
000000b0  ef bf bd ef bf bd ef bf  bd ef bf bd ef bf bd ef  |................|
000000c0  bf bd ef bf bd ef bf bd  ef bf bd ef bf bd ef bf  |................|
000000d0  bd ef bf bd ef bf bd ef  bf bd ef bf bd ef bf bd  |................|
000000e0  ef bf bd ef bf bd ef bf  bd ef bf bd ef bf bd ef  |................|
000000f0  bf bd ef bf bd ef bf bd  ef bf bd ef bf bd ef bf  |................|
00000100  bd ef bf bd ef bf bd ef  bf bd ef bf bd ef bf bd  |................|
00000110  ef bf bd ef bf bd ef bf  bd ef bf bd ef bf bd ef  |................|
00000120  bf bd ef bf bd ef bf bd  ef bf bd ef bf bd ef bf  |................|
00000130  bd ef bf bd ef bf bd ef  bf bd ef bf bd ef bf bd  |................|
00000140  ef bf bd ef bf bd ef bf  bd ef bf bd ef bf bd ef  |................|
00000150  bf bd ef bf bd ef bf bd  ef bf bd ef bf bd ef bf  |................|
00000160  bd ef bf bd ef bf bd ef  bf bd ef bf bd ef bf bd  |................|
00000170  ef bf bd ef bf bd ef bf  bd ef bf bd ef bf bd ef  |................|
00000180  bf bd ef bf bd ef bf bd  ef bf bd ef bf bd ef bf  |................|
00000190  bd ef bf bd ef bf bd ef  bf bd ef bf bd ef bf bd  |................|
000001a0  ef bf bd ef bf bd ef bf  bd ef bf bd ef bf bd ef  |................|
000001b0  bf bd ef bf bd ef bf bd  ef bf bd ef bf bd ef bf  |................|
000001c0  bd ef bf bd ef bf bd ef  bf bd ef bf bd ef bf bd  |................|
000001d0  ef bf bd ef bf bd ef bf  bd ef bf bd ef bf bd ef  |................|
000001e0  bf bd ef bf bd ef bf bd  ef bf bd ef bf bd ef bf  |................|
000001f0  bd ef bf bd ef bf bd ef  bf bd ef bf bd ef bf bd  |................|
00000200

Also gojq uses go == operator to compare string and the golang spec says String values are comparable and ordered, lexically byte-wise so that should be fine also.

So i guess the only reason to still truncate would be that some formats would in common cases produce very big strings for some raw fields (ex mdat in mp4) when convert to JSON. Note that even with no truncate and use string instead of base64 torepr as JSON output would still produce JSON that would not be keeping some bytes (to JSON run-trip issue), it's only binary safe "internally" in fq, ex when use a query to find bookmark blob strings and decode them.

@wader
Copy link
Owner

wader commented Nov 28, 2022

Had a look at the NSKeyedArchiver thingy, seems to represent an object graph? would be cool if we could reconstruct it somehow? i gave it a try but think i might found out where the unknown fields come from. When i look at some of of sfl2 files they have a "root" uid that is 0 but i think they should be 1. Could it be that uid types are decode to short somehow? have a look at this:

$ go run . -o line_bytes=10 '(grep_by(.value=="$top") | parent), .unknown0 | d' '/Users/wader//Library/Application Support/com.apple.sharedfilelist/com.apple.LSSharedFileList.RecentApplications.sfl2'
    │00 01 02 03 04 05 06 07 08 09│0123456789│.objects.entries[2]{}: entry
0x0a│   03                        │ .        │  key_index: 3
0x0a│               07            │     .    │  value_index: 7
    │                             │          │  key{}:
0x1e│                  54         │      T   │    type: "ascii_string" (5) (ASCII encoded string)
0x1e│                  54         │      T   │    size_bits: 4
    │                             │          │    size: 4
0x1e│                     24 74 6f│       $to│    value: "$top"
0x28│70                           │p         │
    │                             │          │  value{}:
0x46│         d1                  │   .      │    type: "dict" (13) (Dictionary)
0x46│         d1                  │   .      │    size_bits: 1
    │                             │          │    size: 1
    │                             │          │    entries[0:1]:
    │                             │          │      [0]{}: entry
0x46│            08               │    .     │        key_index: 8
0x46│               09            │     .    │        value_index: 9
    │                             │          │        key{}:
0x46│                  54         │      T   │          type: "ascii_string" (5) (ASCII encoded string)
0x46│                  54         │      T   │          size_bits: 4
    │                             │          │          size: 4
0x46│                     72 6f 6f│       roo│          value: "root"
0x50│74                           │t         │
    │                             │          │        value{}:
0x50│   80                        │ .        │          type: "uid" (8) (UID)
0x50│   80                        │ .        │          size_bits: 0
    │                             │          │          size: 0
    │                             │          │          value: 0
    │00 01 02 03 04 05 06 07 08 09│0123456789│
0x50│      01                     │  .       │.unknown0: raw bits

Should the 0x01 be part of the uid value?

@dgmcdona
Copy link
Contributor Author

dgmcdona commented Nov 29, 2022

You're right, UID was reading short. I fixed that so that it reads correctly now. I'm still seeing bit ranges unaccounted for, but a side-by-side comparison of the current torepr display and the output of plutil -p doesn't show any discrepancies or missing data.

@dgmcdona
Copy link
Contributor Author

I'm confused about why this won't work:

fq 'grep_by(format="apple_bookmark")' test.plist

But this will:

fq 'grep_by(format=="jpeg")' some_file.bin

Does fq automatically probe raw bit fields when they are encountered? Is this how format works?

@dgmcdona
Copy link
Contributor Author

dgmcdona commented Nov 29, 2022

image

`fq -h apple_bookmarkdata` seems to be producing some strange effects with line breaks. Is this something that is being caused by my editor? I noticed that the format markdown docs for other formats are much more succinct, should I remove all of the explanation of the format and maybe just leave them as references?

format/applebookmark/apple_bookmark.jq Outdated Show resolved Hide resolved
format/applebookmark/apple_bookmark.go Show resolved Hide resolved
format/applebookmark/apple_bookmark.go Outdated Show resolved Hide resolved
format/applebookmark/apple_bookmark.md Outdated Show resolved Hide resolved
@wader
Copy link
Owner

wader commented Nov 29, 2022

fq -h apple_bookmarkdata seems to be producing some strange effects with line breaks. Is this something that is being caused by my editor? I noticed that the format markdown docs for other formats are much more succinct, should I remove all of the explanation of the format and maybe just leave them as references?

Hmm that is strange, i think your markdown should be ok. There is some line break code that is a bit shaky, i suspect that might be the reason (in markdown.jq), will investigate.

@wader
Copy link
Owner

wader commented Nov 29, 2022

You're right, UID was reading short. I fixed that so that it reads correctly now. I'm still seeing bit ranges unaccounted for, but a side-by-side comparison of the current torepr display and the output of plutil -p doesn't show any discrepancies or missing data.

👍 will have a look again also and see if i can figure something out, but i have run into formats that do have gaps that are not "reachable" or not even padding/alignment it seems

@wader
Copy link
Owner

wader commented Nov 29, 2022

I'm confused about why this won't work:

fq 'grep_by(format="apple_bookmark")' test.plist

format is a function in fq that takes no argument (so you leave out the ()) and it returns name of the format if possible, an example probably makes it easier to explain:

$ fq format format/bplist/testdata/Info.plist
"bplist"
$ fq '.header | format' format/bplist/testdata/Info.plist
null
$ fq -n '123 | format'
null

So it only return strings for decode value "roots", this way it can be used to find all nested formats etc.

As a side note, there happens to exist a format/1 function in jq that can be used to encode strings in different format (html, urlencode etc). Didn't even know before i named the fq format/0, is not documented. But jq allows function name overloading so they do not collide.

Back to the query:

So let's break down grep_by(format="apple_bookmark"):

format="apple_bookmark" will be a lambda expressed passed to grep_by, grep_by is a shorthand for .. | select(<some expression here>)? you can see the definition in grep.jq (also .. is shorthand for recurse).

So grep_by will recurse and evaluate format="apple_bookmark for each value, and if it evaluates to something true-ish and not throw error it will output the value (is done with select(...)?), in summary it's a recursive filter that ignored errors. format="apple_bookmark is an assignment so it will eval each side and then assign, LHS format will eval to a string or some format name, RHS just a literal string, so you will get null="apple_bookmark" which is just "apple_bookmark" which is true-ish. End result i think is that you will see all values separately except if format returns a string then i think it will throw an error that will be ignored.


But this will:

fq 'grep_by(format=="jpeg")' some_file.bin


Does `fq` automatically probe raw bit fields when they are encountered? Is this how `format` works?

some_file.bin will be probed and if successful grep_by will output all nested jpeg:s that can be found as == will compare what format outputs with "jpeg".

Hope that helped. I'm thinking about writing some kind of jq guide, so good to practice explaining how jq works :)

So for sfl2 files i think we have to figure out some good way to find bookmarks blobs. I was thinking maybe it's poossible to reconstruct the NSKeyedArchiver object tree (assume it's not a cyclic graph?) and then look for boomarks and decode things?

@wader
Copy link
Owner

wader commented Nov 29, 2022

Fixed the help word breaking a bit #497. Now looks ok i think, but maybe can move large links inside the text to references? they are treated liked a very wide word :)

@dgmcdona
Copy link
Contributor Author

dgmcdona commented Nov 30, 2022

You're explanation was very helpful! Unfortunately my question was malformed (I meant to have a second = in there there to check for equality, not do assignment).

But after thinking some more about things, I think I understand why grep_by(format=="apple_bookmark") does not work in this instance: the bplist decoder package does not explicitly invoke the apple_bookmark decoder, so it has no idea that a bookmark decode root is there; format does not make that determination at runtime. Am I thinking about that correctly?

@wader
Copy link
Owner

wader commented Nov 30, 2022

Looked at the bplist unknown fields a bit again. will answer the other things later, short lunch break now :), but think we're making good progress!

I observed this pattern, some array or object that is skipped or been lost somehow?
Screenshot 2022-11-30 at 13 46 52

@wader
Copy link
Owner

wader commented Nov 30, 2022

But after thinking some more about things, I think I understand why grep_by(format=="apple_bookmark") does not work in this instance: the bplist decoder package does not explicitly invoke the apple_bookmark decoder, so it has no idea that a bookmark decode root is there; format does not make that determination at runtime. Am I thinking about that correctly?

Correct! and i guess it's tricky to make the bplist decoder somehow be able to detect "subformats" like bootmarks, even tricker when its yet another encoding like NSKeyedArchive. So doing it with some jq help function or some snippet if short enough make sense i think.

@wader
Copy link
Owner

wader commented Nov 30, 2022

Looks good overall i think, some things you want to add? the jq snippet code things?

@dgmcdona
Copy link
Contributor Author

dgmcdona commented Dec 1, 2022

Found definitions for the individual bit fields so I added decoding for that. Those are tricky but I think they are being decoded correctly. It's difficult to test and verify. Not quite ready to move forward until I'm sure those are correct, I'm going to take another look tomorrow.

@wader
Copy link
Owner

wader commented Dec 1, 2022

A bit unrelated to this PR but what are some similar tools to fq for doing forensics? think i would be interesting to see how they work

@dgmcdona
Copy link
Contributor Author

dgmcdona commented Dec 2, 2022

https://github.com/ydkhatri/mac_apt
https://ericzimmerman.github.io/#!index.md
https://github.com/simsong/bulk_extractor
https://github.com/sleuthkit/sleuthkit

These are a few. Some have a different mission than fq in that they scan entire live systems for artifacts and collect them, but fundamentally they still have to parse the binary formats at some point. I think this PR actually improves upon the bookmarkData parsing in mac_apt. I have been working on lnk files in a separate branch, and I'm thinking Windows Prefetch files would be the next goal after that. That might be tricky due to compression but I know fq does gzip so I'm sure there's a way to handle it.

The fact that fq can be used cross-platform as a standalone executable sets it apart from many of the tools that use Python, which requires managing dependencies/environments, or the ones that require .NET. Plus, it's fast. I'm not a forensics practitioner so I don't know how much ground needs to be covered for this to become a go-to part of the standard toolset, but I want to just keep adding things and see if people want to use it.

@dgmcdona
Copy link
Contributor Author

dgmcdona commented Dec 6, 2022

Any suggestions for a good workflow for when I need to regenerate fqtest output after decoder changes? It's a bit tedious to go through, delete the old stuff, and rerun them individually; I suspect you might have a better way to do that.


type stack []int64

var offsetStack stack
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed a slice bounds out of range [:-1] panic in pop when i ran the tests and could reproduce with go test -v -count=1 -run TestFormats/applebookmark ./format. I think the issue is that tests run in parallell and offsetStack is modified and shared. Move var into bookmarkDecode and pass around?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't know what I was thinking making that global. Also a problem in pushAndPop, the method references the global variable instead of the struct pointer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some thought the cleanest way to do things was to make the stack internal to the decode.D. Maybe this is good going forward? If so, can fix up the bplist decoder to use it as well since it does something similar.

Copy link
Owner

@wader wader Dec 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't know what I was thinking making that global. Also a problem in pushAndPop, the method references the global variable instead of the struct pointer.

No worries. Im surprised it was not detect by the test running with -race hmm

After some thought the cleanest way to do things was to make the stack internal to the decode.D. Maybe this is good going forward? If so, can fix up the bplist decoder to use it as well since it does something similar.

I'm a bit reluctant to add things to *decode.D, need to think a bit more about it if so. I guess that field would have to get inherited down into sub decoders? could we go with a separate type for this for now? maybe we could put it in the decode package if it should be shared? another option is to put bplist and apple_bookmark into formart/apple etc so they can share things?

What do you think about something like this:

type posLoopDetector []int64

// PushAndPop attempts to push a value to the stack of saved offsets, invoking
// d.Fatalf if the offset is already present on the stack. Returns the Pop
// function, useful for invocation with defer. Intended to be used for
// detecting and short circuiting infinite recursion.
func (s *posLoopDetector) PushAndPop(p int64, detect func()) func() {
	s.Push(p, detect)
	return s.Pop
}

// Push attempts to add an offset to the offset stack, invoking d.Fatalf if the
// offset is already present on the stack. Intended as a means of detecting
// infinite recursion.
func (s *posLoopDetector) Push(p int64, detect func()) {
	for _, o := range *s {
		if p == o {
			detect()
		}
	}
	*s = append(*s, p)
}

// ...

func makeDecodeRecord() func(d *decode.D) {
	var pld posLoopDetector

	var decodeRecord func(d *decode.D)
	decodeRecord = func(d *decode.D) {
		defer pld.PushAndPop(
			d.Pos(),
			func() { d.Fatalf("infinite record loop detected") },
		)()

		d.FieldStruct("record", func(d *decode.D) {
			n := int(d.FieldU32("length"))
			typ := d.FieldU32("type", dataTypeMap)
			switch typ {
			case dataTypeString:
				d.FieldUTF8("data", n)
				d.FieldRawLen("alignment_bytes", 32-((d.Pos()+32)%32))
			case dataTypeData:
				d.FieldRawLen("data", int64(n*8))
			case dataTypeNumber8:
				d.FieldS8("data")
			case dataTypeNumber16:
				d.FieldS16("data")
			case dataTypeNumber32:
				d.FieldS32("data")
			case dataTypeNumber64:
				d.FieldS64("data")
			case dataTypeNumber32F:
				d.FieldF32("data")
			case dataTypeNumber64F:
				d.FieldF64("data")
			case dataTypeDate:
				d.FieldF64BE("data", scalar.DescriptionTimeFn(scalar.S.TryActualF, cocoaTimeEpochDate, time.RFC3339))
			case dataTypeBooleanFalse:
			case dataTypeBooleanTrue:
			case dataTypeArray:
				d.FieldStructNArray("data", "element", int64(n/arrayEntrySize), func(d *decode.D) {
					offset := calcOffset(d.FieldU32("offset"))
					d.SeekAbs(offset, decodeRecord)
				})
			case dataTypeDictionary:
				d.FieldStructNArray("data", "element", int64(n/dictEntrySize), func(d *decode.D) {
					keyOffset := calcOffset(d.FieldU32("key_offset"))
					d.FieldStruct("key", func(d *decode.D) {
						d.SeekAbs(keyOffset, decodeRecord)
					})

					valueOffset := calcOffset(d.FieldU32("value_offset"))
					d.FieldStruct("value", func(d *decode.D) {
						d.SeekAbs(valueOffset, decodeRecord)
					})
				})
			case dataTypeUUID:
				d.FieldRawLen("data", int64(n*8))
			case dataTypeURL:
				d.FieldUTF8("data", n)
			case dataTypeRelativeURL:
				baseOffset := d.FieldU32("base_url_offset")
				d.FieldStruct("base_url", func(d *decode.D) {
					d.SeekAbs(int64(baseOffset), decodeRecord)
				})

				suffixOffset := d.FieldU32("suffix_offset")
				d.FieldStruct("suffix", func(d *decode.D) {
					d.SeekAbs(int64(suffixOffset), decodeRecord)
				})
			}
		})
	}
	return decodeRecord
}

// and then in decodeEntries do:
// d.SeekAbs(calcOffset(key&0x7fffffff), makeDecodeRecord())

Spent some time figure out a test case for this, maybe this in loop.book.fqtest

# loop.book created with:
# fq '[.bookmark_entries[0] | .offset_to_record, .record.data[0].offset | tobytesrange] as [$o0,$o1] | tobytes | [.[0:$o1.start],$o0,.[$o1.start+$o1.size:]] | tobytes' > loop.book 
$ fq -d apple_bookmark ._error.error loop.book
"error at position 0x6c: infinite record loop detected"

(i should really figure out nicer why to deal with errors hmm)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool, I was trying to figure out the cleanest way to add the stack to the scope of the decodeRecord function that didn't make a huge mess when passing parameters around, but I didn't know closures could be used like this!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason it is not catching the infinite recursion in the example that I produced with your fq one-liner there. See the loop.book that I added to the testdata.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason it is not catching the infinite recursion in the example that I produced with your fq one-liner there. See the loop.book that I added to the testdata.

Was a typo, see code comment

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool, I was trying to figure out the cleanest way to add the stack to the scope of the decodeRecord function that didn't make a huge mess when passing parameters around, but I didn't know closures could be used like this!

Yeap very useful pattern! go has quite good support for closures and lambdas, being GC:d and blurring what is stack and heap helps also. The syntax is a bit verbose, but not sure i wish for ml/haskell syntax either, a tiny bit more syntax help would be nice :)

format/applebookmark/apple_bookmark.go Outdated Show resolved Hide resolved
format/applebookmark/apple_bookmark.go Show resolved Hide resolved
doc/formats.md Show resolved Hide resolved
…ebookmark.fqtestdecode: converts applebookmark to use new d.PushAndPop method
@@ -30,6 +36,7 @@ References
==========
- https://developer.apple.com/documentation/foundation/url/2143023-bookmarkdata
- https://mac-alias.readthedocs.io/en/latest/bookmark_fmt.html
- https://www.mac4n6.com/blog/2016/1/1/manual-analysis-of-nskeyedarchiver-formatted-plist-files-a-review-of-the-new-os-x-1011-recent-items
-
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guess this is line breaking code gone wrong when in lists? hmm maybe keep it like this and i will have look at is separately

}

var resourcePropDecoder = &dataObjectDecoder{decodeTgtPropertyFlagBits}
var volumePropDecoder = &dataObjectDecoder{decodeVolPropertyFlagBits}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about something like:

diff --git a/format/applebookmark/apple_bookmark.go b/format/applebookmark/apple_bookmark.go
index 4716d4d0..d22bc83c 100644
--- a/format/applebookmark/apple_bookmark.go
+++ b/format/applebookmark/apple_bookmark.go
@@ -150,22 +150,12 @@ func decodeFlagDataObject(d *decode.D, flagFn func(d *decode.D)) {
 		d.FieldU32("length", d.AssertU(dataObjectLen))
 		d.FieldU32("raw_type", dataTypeMap, d.AssertU(dataTypeData))
 		d.FieldValueStr("type", "flag_data")
-		decodePropertyFlags(d, flagFn)
+		d.FieldStruct("property_flags", flagFn)
+		d.FieldStruct("enabled_property_flags", flagFn)
 		d.FieldRawLen("reserved", 64)
 	})
 }
 
-type dataObjectDecoder struct {
-	flagFn func(d *decode.D)
-}
-
-func (dod *dataObjectDecoder) decode(d *decode.D) {
-	decodeFlagDataObject(d, dod.flagFn)
-}
-
-var resourcePropDecoder = &dataObjectDecoder{decodeTgtPropertyFlagBits}
-var volumePropDecoder = &dataObjectDecoder{decodeVolPropertyFlagBits}
-
 func decodeTgtPropertyFlagBits(d *decode.D) {
 	start := d.Pos()
 	d.FieldBool("is_hidden")
@@ -251,12 +241,6 @@ func decodeVolPropertyFlagBits(d *decode.D) {
 	d.FieldBool("supports_volume_sizes")
 }
 
-func decodePropertyFlags(d *decode.D, bitFn func(d *decode.D)) {
-	d.FieldStruct("property_flags", bitFn)
-
-	d.FieldStruct("enabled_property_flags", bitFn)
-}
-
 var cocoaTimeEpochDate = time.Date(2001, time.January, 1, 0, 0, 0, 0, time.UTC)
 
 type tocHeader struct {
@@ -286,9 +270,9 @@ func (hdr *tocHeader) decodeEntries(d *decode.D) {
 
 			switch entry.key {
 			case elementTypeTargetFlags:
-				d.SeekAbs(entry.recordOffset, resourcePropDecoder.decode)
+				d.SeekAbs(entry.recordOffset, func(d *decode.D) { decodeFlagDataObject(d, decodeTgtPropertyFlagBits) })
 			case elementTypeVolumeFlags:
-				d.SeekAbs(entry.recordOffset, volumePropDecoder.decode)
+				d.SeekAbs(entry.recordOffset, func(d *decode.D) { decodeFlagDataObject(d, decodeVolPropertyFlagBits) })
 			default:
 				d.SeekAbs(entry.recordOffset, decodeRecord)
 			}

sometimes i wish go hade a little bit of more functional languages things like currying :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably better. I was trying to avoid the function declaration inside of the d.SeekAbs parameter list but it ended up creating a whole bunch of other code that isn't being reused. I haven't done much functional programming outside of general familiarity with first-class functions, but currying does seem like it would help here.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I guess we could try use generics to make this pattern nicer. It seems to happen quite a lot, maybe something like d.SeekAbsFn1<T>(offset, some-T-value, fn(d *decode.D, v T) { ... }) could save on nested functions in the decoder code (and have SeekAbsFn2 if it takes two generic args)

})
}

entry.recordOffset = calcOffset(d.FieldU32("offset_to_record"))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could both key and recordOffset this be a local variable instead? skip the tocEntry struct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have to leave this as-is so that torepr will continue to work properly (it relies on the key translations in the elementTypeMap).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha i meant still have key and offset_to_record fields but do recordOffset := calcOffset(d.FieldU32("offset_to_record")) and above key := d.FieldU32("key", elementTypeMap) as they are not used outside each iteration?

@wader
Copy link
Owner

wader commented Dec 7, 2022

Sorry for all the late comments, hard to review by just reading so once you start poking around in an editor you notice things :)

But think we're close to merge now

defer pld.pushAndPop(
d.Pos(),
func() { d.Fatalf("infinite recursion detected in record decode function") },
)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo here, pop function is not defer called, add a (), after that the loop fqtest works for me 👍

@wader
Copy link
Owner

wader commented Dec 8, 2022

Any suggestions for a good workflow for when I need to regenerate fqtest output after decoder changes? It's a bit tedious to go through, delete the old stuff, and rerun them individually; I suspect you might have a better way to do that.

Did you get the WRITE_ACTUAL workflow to work?

Some other workflows i use:

Use watchexec etc so i don't have to switch terminals so much, ex run fq with some nice query in a separate terminal:

# -c clear screen
watchexec 'go run . torepr format/applebookmark/testdata/sample1.book'
watchexec -c 'go run . torepr format/applebookmark/testdata/sample1.book'
watchexec -c -- go run . torepr format/applebookmark/testdata/sample1.book

Use LOGFILE:

LOGFILE=/tmp/fq go run . -d blabla ...

Now log.Printf and friends output will be written to /tmp/fq, and then i usually run a separate terminal with tail -f /tmp/fq. This is very useful when the output from fq gets to mixed up.

It's usually also nice in general to spend some time to get a query that shows just the part your work on and then start to do changes.

@wader
Copy link
Owner

wader commented Dec 9, 2022

Ready to merge? only nit pick left was to possibly replace tocEntry with local vars in decodeEntries.

@dgmcdona
Copy link
Contributor Author

Yeah I think it's ready, just made the final edit you suggested

@dgmcdona
Copy link
Contributor Author

One last question though - I keep getting test failures locally for three formats, all for the same reason (differences in the way raw byte ranges are represented as strings). Something misconfigured locally maybe?

    --- FAIL: TestFormats/mp4/testdata/in24.fqtest (0.10s)
        fqtest.go:15:
            --- expected
            +++ actual
            @@ -111,7 +111,7 @@
             0x240|00 00 00 00                                    |....            |                  modification_time: 0 (1904-01-04T00:00:00Z) 0x240-0x243.7 (4)
             0x240|            00 00 ac 44                        |    ...D        |                  time_scale: 44100 0x244-0x247.7 (4)
             0x240|                        00 00 00 2c            |        ...,    |                  duration: 44 0x248-0x24b.7 (4)
            -0x240|                                    7f ff      |            ..  |                  language: "\x7f\x7f\x7f" 0x24c-0x24d.7 (2)
            +0x240|                                    7f ff      |            ..  |                  language: "\u007f\u007f\u007f" 0x24c-0x24d.7 (2)
             0x240|                                          00 00|              ..|                  quality: 0 0x24e-0x24f.7 (2)
                  |                                               |                |                [1]{}: box 0x250-0x27c.7 (45)
             0x250|00 00 00 2d                                    |...-            |                  size: 45 0x250-0x253.7 (4)

@wader wader merged commit 0b94269 into wader:master Dec 10, 2022
@wader
Copy link
Owner

wader commented Dec 10, 2022

🥳

@wader
Copy link
Owner

wader commented Dec 10, 2022

One last question though - I keep getting test failures locally for three formats, all for the same reason (differences in the way raw byte ranges are represented as strings). Something misconfigured locally maybe?

Hmm i recognize this issue. Have a vague memory there was a string escape different between go 1.18 and 1.19? could that be it? if so try to update, but we can probably add some workaround also

@dgmcdona
Copy link
Contributor Author

Upgraded to 1.19 and that fixed it, thx

@wader
Copy link
Owner

wader commented Dec 10, 2022

Great, but maybe a workaround is worth it, official go 1.18 support will be dropped in 4 month i think but i suspect that new versions of fq will be able to build with it for a long time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants