Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance - Avoid repeated git diff calls #154

Conversation

aswinkarthik
Copy link
Contributor

@aswinkarthik aswinkarthik commented Oct 14, 2019

Context

I was measuring why talisman took lot of time when there are lot of git additions involved. I was trying to seggregate each call and the time it took.

As a test, assume we git add the vendor/ directory of talisman (~400 additions)

These were the main observation:

GetDiffForStagedFiles: git repo additions took 6.198876156s
detector.FileNameDetector took 51.372538ms
*detector.FileContentDetector took 242.377893ms
*detector.PatternDetector took 1.894126904s

real    0m8.459s
user    0m13.300s
sys     0m2.266s

The command took 8 seconds inspite of changing detectors to use channels. The main culprit was GetDiffForStagedFiles. Looking into it, I saw that there were 400 git diff --staged commands being made for each single file.

Solution

  • Remove one call per file.
  • Do one git diff --staged which gives the complete output of all git diff --staged {FILENAME} combined together.
  • Git diff follows the following format mentioned here
diff --git a/{filename} b/{filename}
...
...
+...
-...
diff --git a/{filename_2} b/{filename_2}
...

This is structured format and each file's diff starts with that header. So the approach,

  • Looks for the header and get filename
  • Accumulates the diff content for the given filename till the next header comes.
  • When next header comes, send the previous file's git diff --staged content to fetchStagedDiff which does a few trimming and adds the staged addition (now renamed to extractAdditions).
  • Continues till end of output

Performance post that:

GetDiffForStagedFiles: git repo additions took 448.128554ms
detector.FileNameDetector took 67.564441ms
*detector.FileContentDetector took 246.185828ms
*detector.PatternDetector took 1.867180727s

real    0m2.707s
user    0m11.829s
sys     0m0.219s

Fetching git diff is 15X faster.
Talisman is ~4X faster now.

Questions:

  1. What do you think about this approach?
  2. I did not add new functionality and hence did not add tests. Should I add new tests for this? gitrepo_test passes. I slightly modified to not panic on assertion failures.

For future scope, all pattern matching can be improved by not looping over all patterns and trying to match through a single pass. But that can be made as a separate PR (which can reduce PatternDetector's time taken, second highest time taken, further)

@aswinkarthik
Copy link
Contributor Author

Addresses #152

@svishwanath-tw
Copy link
Collaborator

@aswinkarthik Excellent work on the PR!!

@svishwanath-tw svishwanath-tw merged commit 44eef9f into thoughtworks:master Oct 17, 2019
@aswinkarthik
Copy link
Contributor Author

Thank you @svishwanath-tw

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants