Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update gitparse logic #1486

Merged
merged 1 commit into from
Jul 25, 2023
Merged

Conversation

rgmz
Copy link
Contributor

@rgmz rgmz commented Jul 12, 2023

This fixes #1457.

It's a first-pass rewrite of gitparse.go and could undoubtedly use some changes. Feedback is welcome.

At the very least, it needs better error-handling for unhandled edge cases. I plan to run trufflehog against different large repositories to see if there are any issues.

This should correctly parse all of the following scenarios. For the sake of keeping these diagrams simple and legible, "end" represents either a new commit or diff, or the end of the input.

---
title: Git history parsing flow
---
stateDiagram-v2
    [*] --> CommitLine

    CommitLine --> AuthorLine
    AuthorLine --> DateLine
    DateLine --> MessageStartLine
    DateLine --> [*]: Empty message and commit

    MessageStartLine --> DiffLine: Empty message
    MessageStartLine --> [*]: Empty message and commit
    MessageStartLine --> MessageLine
    MessageLine --> MessageLine
    MessageLine --> MessageEndLine
    MessageEndLine --> DiffLine
    MessageEndLine --> [*]: Empty commit

%% If modes are present
    DiffLine --> ModeLine
    ModeLine --> ModeLine
    ModeLine --> IndexLine
    ModeLine --> [*]: No content changes
%% otherwise
    DiffLine --> IndexLine

%% If it's a binary diff
    IndexLine --> BinaryFileLine
%% If there are multiple diffs
    BinaryFileLine --> DiffLine
    BinaryFileLine --> [*]

%% If it's normal text content
    IndexLine --> FromFileLine
    FromFileLine --> ToFileLine
    ToFileLine --> HunkLineNumber
    ToFileLine --> [*]: Empty file
    HunkLineNumber --> HunkContentLine
    HunkContentLine --> HunkLineNumber 
    HunkContentLine --> HunkContentLine

    HunkContentLine --> DiffLine
    HunkContentLine --> [*]
Loading
---
title: Git staged diff parsing flow
---
stateDiagram-v2
    [*] --> DiffLine

%% If modes are present
    DiffLine --> ModeLine
    ModeLine --> ModeLine
    ModeLine --> IndexLine
    ModeLine --> [*]: No content changes
%% otherwise
    DiffLine --> IndexLine

%% If it's a binary diff
    IndexLine --> BinaryFileLine
    BinaryFileLine --> [*]

%% If it's normal text content
    IndexLine --> FromFileLine
    FromFileLine --> ToFileLine
    ToFileLine --> HunkLineNumber
    ToFileLine --> [*]: Empty file
    HunkLineNumber --> HunkContentLine
    HunkContentLine --> HunkLineNumber 
    HunkContentLine --> HunkContentLine

    HunkContentLine --> [*]
Loading

To-dos:

/cc @zricethezav

@rgmz rgmz requested a review from a team as a code owner July 12, 2023 02:37
@rgmz rgmz force-pushed the feat/git-diff-parse branch 3 times, most recently from e46ff90 to a169ddc Compare July 12, 2023 03:08
@rgmz rgmz marked this pull request as draft July 12, 2023 13:53
@dustin-decker
Copy link
Contributor

Thanks for taking this on @rgmz! Looping in @bill-rich to have a look as well.

@rgmz rgmz force-pushed the feat/git-diff-parse branch 3 times, most recently from e55a5fe to 761b111 Compare July 12, 2023 18:41
@zricethezav
Copy link
Collaborator

@rgmz thanks for taking a crack at this issue! At a glance this looks like some high quality stuff. I'll put some eyes on it this afternoon.

At the very least, it needs better error-handling for unhandled edge cases. I plan to run trufflehog against different large repositories to see if there are any issues.

Thanks for being so thorough with your testing! Would you mind letting us know your results? Also feel free to ping us when you remove the draft label. In the meantime I'll start reviewing some of this

@bill-rich
Copy link
Collaborator

This is great! The context awareness for each line looks like it'll solve most of the hacky bits used before.

@rgmz
Copy link
Contributor Author

rgmz commented Jul 12, 2023

Would you mind letting us know your results?

I ran this against all the repositories listed in the description and none of them panicked, which would suggest that the logic covers many possibilities. I wouldn't be surprised if this still has some edge cases, however, I haven't found any yet.

default:
if currentCommit != nil {
fmt.Printf("Failed to parse diff (%s).\n\tLatest state (%s) did not match line (%s)\n", currentCommit.Hash, latestState, line)
} else {
fmt.Printf("Failed to parse diff.\n\tLatest state (%s) did not match line (%s)\n", latestState, line)
}
panic("Failed to parse diff (%s).")

This is great! The context awareness for each line looks like it'll solve most of the hacky bits used before.

Tracking prev, current, and next lines would further tighten up the parsing logic. I would consider this as more correct rather than perfect. 😅

@rgmz rgmz marked this pull request as ready for review July 12, 2023 23:06
@zricethezav
Copy link
Collaborator

@rgmz amazing work Richard! I just have one comment that is worth addressing, otherwise I think this PR is good to go. Thanks for adding thorough unit tests and reporting your integration test. Super high quality PR 🙇🏻 🥳 🎈

@rgmz rgmz force-pushed the feat/git-diff-parse branch 3 times, most recently from 6681fc5 to a0dfea1 Compare July 20, 2023 16:27
@rgmz
Copy link
Contributor Author

rgmz commented Jul 20, 2023

I replaced the panic with some basic logging + recovery logic.

However, TestMaxDiffSize now seems to be failing intermittently and I am not sure if that was caused by these changes.

=== RUN   TestMaxDiffSize
runtime: pointer 0xc0005bfcf0 to unallocated span span.base()=0xc000580000 span.limit=0xc0005c0000 span.state=0
runtime: found in object at *(0x11fd780+0x15050)
object=0x11fd780 s=nil
fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?)

runtime stack:
runtime.throw({0xc410cc?, 0x6?})
	/home/user/sdk/go1.20.6/src/runtime/panic.go:1047 +0x5d fp=0x7fb907ffeb50 sp=0x7fb907ffeb20 pc=0x43919d
runtime.badPointer(0x7fb92522d198, 0xc000101000?, 0x11fd780, 0x6?)
	/home/user/sdk/go1.20.6/src/runtime/mbitmap.go:314 +0x150 fp=0x7fb907ffeba0 sp=0x7fb907ffeb50 pc=0x416ab0
runtime.findObject(0xc0002398c0?, 0x1f17?, 0x2fb29d66?)
	/home/user/sdk/go1.20.6/src/runtime/mbitmap.go:357 +0xa6 fp=0x7fb907ffebd8 sp=0x7fb907ffeba0 pc=0x416c46
runtime.scanblock(0x11fd780, 0x19230, 0x7fb94ef5eda8, 0x0?, 0x0)
	/home/user/sdk/go1.20.6/src/runtime/mgcmark.go:1237 +0xb8 fp=0x7fb907ffec40 sp=0x7fb907ffebd8 pc=0x421618
runtime.markrootBlock(0xc70450?, 0x7fb907ffecb0?, 0x40e109?, 0xc0005c0148?, 0x0?)
	/home/user/sdk/go1.20.6/src/runtime/mgcmark.go:284 +0x65 fp=0x7fb907ffec80 sp=0x7fb907ffec40 pc=0x41f5e5
runtime.markroot(0xc000051738, 0x2, 0x1)
	/home/user/sdk/go1.20.6/src/runtime/mgcmark.go:170 +0x3d1 fp=0x7fb907ffed20 sp=0x7fb907ffec80 pc=0x41f3d1
runtime.gcDrain(0xc000051738, 0x3)
	/home/user/sdk/go1.20.6/src/runtime/mgcmark.go:1069 +0x39f fp=0x7fb907ffed80 sp=0x7fb907ffed20 pc=0x42129f
runtime.gcBgMarkWorker.func2()
	/home/user/sdk/go1.20.6/src/runtime/mgc.go:1348 +0xad fp=0x7fb907ffedd0 sp=0x7fb907ffed80 pc=0x41d7ad
runtime.systemstack()
	/home/user/sdk/go1.20.6/src/runtime/asm_amd64.s:496 +0x49 fp=0x7fb907ffedd8 sp=0x7fb907ffedd0 pc=0x46d1e9

goroutine 11 [GC worker (active)]:
runtime.systemstack_switch()
	/home/user/sdk/go1.20.6/src/runtime/asm_amd64.s:463 fp=0xc000505750 sp=0xc000505748 pc=0x46d180
runtime.gcBgMarkWorker()
	/home/user/sdk/go1.20.6/src/runtime/mgc.go:1335 +0x205 fp=0xc0005057e0 sp=0xc000505750 pc=0x41d445
runtime.goexit()
	/home/user/sdk/go1.20.6/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0005057e8 sp=0xc0005057e0 pc=0x46f3a1
created by runtime.gcBgMarkStartWorkers
	/home/user/sdk/go1.20.6/src/runtime/mgc.go:1199 +0x25

@bill-rich
Copy link
Collaborator

Did some tests against large repos and the findings match the existing version. Performance also seems on par, so things look good on both of those fronts.

@zricethezav
Copy link
Collaborator

I replaced the panic with some basic logging + recovery logic.

@rgmz Woof, that panic looks gnarly and hard to gain anything from that trace since it looks like it's cgo. This just started popping up after the error logging?

@rgmz

This comment was marked as resolved.

@rgmz
Copy link
Contributor Author

rgmz commented Jul 22, 2023

It seems that Go really dislikes expectedCommits being top-level. Moving it into TestCommitParsing and TestIndividualCommitParsing seems to fix the panic.

Give it a try locally and let me know if you can reproduce this.

$ go test -count=1 -timeout 30s -run ^TestMaxDiffSize$ github.com/trufflesecurity/trufflehog/v3/pkg/gitparse

@rgmz rgmz force-pushed the feat/git-diff-parse branch 3 times, most recently from 2a983ff to 7e6c850 Compare July 22, 2023 15:32
@rgmz rgmz force-pushed the feat/git-diff-parse branch 2 times, most recently from 9588405 to 317313e Compare July 24, 2023 14:56
@zricethezav
Copy link
Collaborator

Hey @rgmz

It seems that Go really dislikes expectedCommits being top-level. Moving it into TestCommitParsing and TestIndividualCommitParsing seems to fix the panic.
Give it a try locally and let me know if you can reproduce this.

I'm not seeing any panics! Nice job. Top-level variables used in test files sound unstable-ish so I'm glad you figured out a fix.

Copy link
Collaborator

@zricethezav zricethezav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @rgmz!!!

@zricethezav zricethezav merged commit f48a635 into trufflesecurity:main Jul 25, 2023
9 of 10 checks passed
@rgmz rgmz deleted the feat/git-diff-parse branch July 25, 2023 23:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Git diff starting with ++ results in an incorrect filename
4 participants