Improve performance - Avoid repeated git diff calls #154

aswinkarthik · 2019-10-14T18:38:28Z

Context

I was measuring why talisman took lot of time when there are lot of git additions involved. I was trying to seggregate each call and the time it took.

As a test, assume we git add the vendor/ directory of talisman (~400 additions)

These were the main observation:

GetDiffForStagedFiles: git repo additions took 6.198876156s
detector.FileNameDetector took 51.372538ms
*detector.FileContentDetector took 242.377893ms
*detector.PatternDetector took 1.894126904s

real    0m8.459s
user    0m13.300s
sys     0m2.266s

The command took 8 seconds inspite of changing detectors to use channels. The main culprit was GetDiffForStagedFiles. Looking into it, I saw that there were 400 git diff --staged commands being made for each single file.

Solution

Remove one call per file.
Do one git diff --staged which gives the complete output of all git diff --staged {FILENAME} combined together.
Git diff follows the following format mentioned here

diff --git a/{filename} b/{filename}
...
...
+...
-...
diff --git a/{filename_2} b/{filename_2}
...

This is structured format and each file's diff starts with that header. So the approach,

Looks for the header and get filename
Accumulates the diff content for the given filename till the next header comes.
When next header comes, send the previous file's git diff --staged content to fetchStagedDiff which does a few trimming and adds the staged addition (now renamed to extractAdditions).
Continues till end of output

Performance post that:

GetDiffForStagedFiles: git repo additions took 448.128554ms
detector.FileNameDetector took 67.564441ms
*detector.FileContentDetector took 246.185828ms
*detector.PatternDetector took 1.867180727s

real    0m2.707s
user    0m11.829s
sys     0m0.219s

Fetching git diff is 15X faster.
Talisman is ~4X faster now.

Questions:

What do you think about this approach?
I did not add new functionality and hence did not add tests. Should I add new tests for this? gitrepo_test passes. I slightly modified to not panic on assertion failures.

For future scope, all pattern matching can be improved by not looping over all patterns and trying to match through a single pass. But that can be made as a separate PR (which can reduce PatternDetector's time taken, second highest time taken, further)

…command

aswinkarthik · 2019-10-14T18:46:13Z

Addresses #152

gitrepo/gitrepo.go

svishwanath-tw · 2019-10-17T08:56:20Z

@aswinkarthik Excellent work on the PR!!

aswinkarthik · 2019-10-17T15:53:17Z

Thank you @svishwanath-tw

Avoid multiple git diff staged commands and parse output from single …

1147d8d

…command

svishwanath-tw reviewed Oct 17, 2019

View reviewed changes

gitrepo/gitrepo.go Show resolved Hide resolved

svishwanath-tw merged commit 44eef9f into thoughtworks:master Oct 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance - Avoid repeated git diff calls #154

Improve performance - Avoid repeated git diff calls #154

aswinkarthik commented Oct 14, 2019 •

edited

aswinkarthik commented Oct 14, 2019

svishwanath-tw commented Oct 17, 2019

aswinkarthik commented Oct 17, 2019

Improve performance - Avoid repeated git diff calls #154

Improve performance - Avoid repeated git diff calls #154

Conversation

aswinkarthik commented Oct 14, 2019 • edited

Context

Solution

Questions:

aswinkarthik commented Oct 14, 2019

svishwanath-tw commented Oct 17, 2019

aswinkarthik commented Oct 17, 2019

aswinkarthik commented Oct 14, 2019 •

edited