Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decrease Out Of Memory chances by scanning 250 commits at a time #270

Closed
wants to merge 1 commit into from

Conversation

dcRUSTy
Copy link
Collaborator

@dcRUSTy dcRUSTy commented Oct 25, 2020

Previously all additions from all commits were read at time and then scanned which may cause OOM when a repo is huge. Now at a time additions of 250 commits will be read then scanned and so on.

250commits

Thoughts - What should be the ideal number of commits to scan at a time?

@dcRUSTy dcRUSTy marked this pull request as draft October 25, 2020 10:08
Copy link
Collaborator

@svishwanath-tw svishwanath-tw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to scan number of files at a time instead of commits.

log.Fatal(err)
}
result := strings.Split(string(out), "\n")[0]
count, _ := strconv.ParseUint(result, 10, 64)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ignore the error ?

@dcRUSTy
Copy link
Collaborator Author

dcRUSTy commented Oct 28, 2020

Read -> Scan -> Read -> Scan
will have some latency issues...
Ideal solution would be some multithreaded pipeline/buffer solution with a upperlimit on buffer size based on available RAM... but it will make things more complex 😅

@teleivo
Copy link
Contributor

teleivo commented Apr 22, 2021

Hi, I think that is an interesting approach! And thank you for working on this performance issue 😄 I am having trouble to finish scans of any "larger" repo on my machine. Repos that are not even considered outrageous by https://github.com/github/git-sizer . The reasons for it might be more than just memory.

As I just recently started looking into the talisman code my thoughts/questions might be naive so forgive me 😄
My current understanding of a talisman --scan is that the scanner is reading the contents of every reachable blob into memory. Which is I assume what this PR is about.

  • Has it been considered to lazily read the blob content into memory. By that I mean just before running the detector tests?
  • The scanner currently also reads blob contents of files into memory that are ignored in the .talismanrc. This might be avoidable. Not sure if it would be feasible to reuse the git sha as the checksum. Same blob content = same git sha. Also not sure how much of a performance gain that would be. How many files do people ignore, how often are they changed afterwards, ... But avoiding the calculation of a separate sha256 checksum might be a gain. I do not know the design decisions behind this so there are probably legitimate reasons for not reusing the git sha1.

I am playing around with pprof, tracing and git-sizer to better understand the code and where its current bottlenecks are. If I have a clearer picture of this I can post some results. Please point me to your analysis if you already have some results from pprof or similar.

Happy to discuss this further :)

@dcRUSTy
Copy link
Collaborator Author

dcRUSTy commented Apr 22, 2021

@teleivo
#181
#208
It was inspired due to these issues.. the current approach i think is, talisman reads(copies to RAM) everything till initial commit and then scans it(I might be wrong long time since i read the code last)

This copying full git repo contents to memory is what i thought was causing these issues.

@svishwanath-tw
Copy link
Collaborator

@dcRUSTy : I've just completed a long pending release. Please close this PR or make it ready for merge, this feature is sought after.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants