Decrease Out Of Memory chances by scanning 250 commits at a time #270

dcRUSTy · 2020-10-25T09:51:25Z

Previously all additions from all commits were read at time and then scanned which may cause OOM when a repo is huge. Now at a time additions of 250 commits will be read then scanned and so on.

Thoughts - What should be the ideal number of commits to scan at a time?

svishwanath-tw

It would be better to scan number of files at a time instead of commits.

svishwanath-tw · 2020-10-28T09:35:02Z

runner.go

+		log.Fatal(err)
+	}
+	result := strings.Split(string(out), "\n")[0]
+	count, _ := strconv.ParseUint(result, 10, 64)


Why ignore the error ?

dcRUSTy · 2020-10-28T14:26:06Z

Read -> Scan -> Read -> Scan
will have some latency issues...
Ideal solution would be some multithreaded pipeline/buffer solution with a upperlimit on buffer size based on available RAM... but it will make things more complex 😅

teleivo · 2021-04-22T05:06:19Z

Hi, I think that is an interesting approach! And thank you for working on this performance issue 😄 I am having trouble to finish scans of any "larger" repo on my machine. Repos that are not even considered outrageous by https://github.com/github/git-sizer . The reasons for it might be more than just memory.

As I just recently started looking into the talisman code my thoughts/questions might be naive so forgive me 😄
My current understanding of a talisman --scan is that the scanner is reading the contents of every reachable blob into memory. Which is I assume what this PR is about.

Has it been considered to lazily read the blob content into memory. By that I mean just before running the detector tests?
The scanner currently also reads blob contents of files into memory that are ignored in the .talismanrc. This might be avoidable. Not sure if it would be feasible to reuse the git sha as the checksum. Same blob content = same git sha. Also not sure how much of a performance gain that would be. How many files do people ignore, how often are they changed afterwards, ... But avoiding the calculation of a separate sha256 checksum might be a gain. I do not know the design decisions behind this so there are probably legitimate reasons for not reusing the git sha1.

I am playing around with pprof, tracing and git-sizer to better understand the code and where its current bottlenecks are. If I have a clearer picture of this I can post some results. Please point me to your analysis if you already have some results from pprof or similar.

Happy to discuss this further :)

dcRUSTy · 2021-04-22T06:53:29Z

@teleivo
#181
#208
It was inspired due to these issues.. the current approach i think is, talisman reads(copies to RAM) everything till initial commit and then scans it(I might be wrong long time since i read the code last)

This copying full git repo contents to memory is what i thought was causing these issues.

svishwanath-tw · 2021-10-18T18:52:04Z

@dcRUSTy : I've just completed a long pending release. Please close this PR or make it ready for merge, this feature is sought after.

Decrease Out Of Memory chances by scanning 250 commits at a time

95a8616

dcRUSTy marked this pull request as draft October 25, 2020 10:08

svishwanath-tw reviewed Oct 28, 2020

View reviewed changes

dcRUSTy closed this Oct 20, 2021

svishwanath-tw mentioned this pull request Jul 23, 2022

Filesize zero error while scanning the git repo with talisman #374

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decrease Out Of Memory chances by scanning 250 commits at a time #270

Decrease Out Of Memory chances by scanning 250 commits at a time #270

dcRUSTy commented Oct 25, 2020 •

edited

svishwanath-tw left a comment

svishwanath-tw Oct 28, 2020

dcRUSTy commented Oct 28, 2020

teleivo commented Apr 22, 2021

dcRUSTy commented Apr 22, 2021

svishwanath-tw commented Oct 18, 2021

Decrease Out Of Memory chances by scanning 250 commits at a time #270

Decrease Out Of Memory chances by scanning 250 commits at a time #270

Conversation

dcRUSTy commented Oct 25, 2020 • edited

svishwanath-tw left a comment

Choose a reason for hiding this comment

svishwanath-tw Oct 28, 2020

Choose a reason for hiding this comment

dcRUSTy commented Oct 28, 2020

teleivo commented Apr 22, 2021

dcRUSTy commented Apr 22, 2021

svishwanath-tw commented Oct 18, 2021

dcRUSTy commented Oct 25, 2020 •

edited