-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vastly improve performance #157
Conversation
This is awesome! I'll review more when I get a chance, but just some quick responses:
I'd like to not support Note to self: submodules should be detected as vendored by default. |
Great, take your time! I will fix CI and undo the change to attributes. Also I plan to make a new release of When looking at linguist and remembering that it took 5min on a repo, I couldn't stop to wonder if they run this at GitHub scale. If so, how much time must be wasted 🥺. |
Doing this avoids touching the disk.
CI should be fixed now, and the |
Performance is probably much better on GitHub's hardware than my own limited VM 😆 but yeah, it does raise that question. And Linguist has had some issues -- recently, one of its heuristics was a ReDOS that stopped some files from rendering or getting highlighted. Ironically, that ReDOS was introduced in an attempt to get There's also the question of what exactly causes the time spent for the Linux repo. Is it simply the file count, or are there file(s) that severely hit their Bayesian classifier, or something else? Usage as a library might be more performant as a binary, too. If the classifier is set up for each time you run the binary, but only once for the library, for example. I guess it's time to admit my hopeful thinking: I know that GitHub switched their code search to a Rust implementation for more performance. Could there perhaps be a Rust project for GitHub's language detection? 🤔 😆 |
BTW don't worry about the failing tests for now. I'll update the |
Codecov Report
@@ Coverage Diff @@
## main #157 +/- ##
==========================================
- Coverage 83.65% 82.33% -1.33%
==========================================
Files 14 14
Lines 514 566 +52
==========================================
+ Hits 430 466 +36
- Misses 84 100 +16
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
I actually had a feeling :D! But then you said there are differences and incompatibilities which makes that less likely to happen, so they would probably rather do their own RiR. Time will tell what's going to happen though :). Do you have an idea why people would add |
One of the most common is probably People use "prose" and "data" are excluded by default from language statistics, so sometimes you want to use One of the intentional divergences from linguist is how these attributes behave: the linguist maintainers will admit that these attributes can be a bit confusing. So |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again!
I intend to make a new release once spenserblack#157 is merged.
Thanks for the review. |
Yup! I wasn't that clear, but some of these are to dos for myself. I'd appreciate if quickly skimmed over my style changes/refactoring later, though, just to confirm the changes are purely code style, not affecting behavior. |
The `max-performance` and `max-performance-safe` gix features have been re-exported so that consumers can enable/disable them without importing a matching version of `gix`.
🤢 Per-OS snapshot tests certainly have their downsides 😆 I should've learned from repofetch. @Byron as the gix expert, do you have any recommendations on detecting files from submodules as vendored? My lazy self was thinking of just adding a It would probably be cleanest to implement this in the |
🤔 Now that I think about it, while traversing submodules is really cool, I wonder if it should be dropped. I'd want to mark submodules as vendored, but that is very specific to git, and could block, or at least make much more difficult to implement, other file sources. And submodules are technically other repos, and usually not code that belongs to the main project... |
I thought the same - with that flag it should be possible to implement any logic. And yes, it's
Not traversing submodules is faster, of course, and maybe for that reason alone it could be a flag to be configured. However, I think it's nothing to skip as, at least I, am interesting how much of the project is vendored. Maybe that's entirely the wrong thing to do though, and it's better to look at how |
I thought about this again and definitely think that submodules should not be ignored. However, it's the question what to do with them by default or… how to allow users to affect what they are considered. There is no problem in assigning attributes to submodules, and I think this could be used to affect if they should be vendored or not. This reminds me, maybe one should check for |
How would # /.gitattributes
A/* eol=lf Anyway, for now I'm really excited about this PR, so I'd be happy to merge this and continue the discussion on how to handle submodules later. So perhaps just an |
OK, thinking about it a bit more, it might be nice for |
Actually it would - it's perfectly viable to use complete paths of submodule files (as relative to the superproject root) for an attribute query in the superproject. But that's not what I meant - I merely meant to check the path of the
That would be enough to declare all submodule files vendored, I think.
Maybe that doesn't matter (yet) for merging the PR like you mentioned. However, I don't think submodules should always be vendored, there should be en escape hatch, that just treats them like ordinary code that is part of the tree (like it's done now). Then tying submodules into vendor-handling too much seems counter-intuitive. But again, in the end it doesn't matter as long as the code does what it should and you define what that is. |
Going to merge as-is, we can handle submodules in a separate PR. I'll see if I can get something set up. I agree that submodules shouldn't be always vendored, but I do think that should be the default, since "vendored" basically means "files from another project that are provided in this repo." |
@all-contributors add @Byron for code and userTesting |
I've put up a pull request to add @Byron! 🎉 |
The approach taken here it so provide multiple steps of the transformation which should be reviewed commit by commit. The goal is to start out sticking with the original structure and then transform it into the fastest possible solution. The latter will look quite different, thus the step-wise approach.
Proposed Review Workflow
I'd appreciate if you could make the changes as you see fit and push directly into this PR.
gh pr checkout 157
andgit push
should do the trick for that. That way we can spare each other slow round-trips that typically aren't necessary as you can more quickly do what's needed.Review Notes
test/javascript
branch now has to be adjusted to contain.gitattributes
as I changed the.gitattributes
sources. Feel free to just drop the commit though if it's too uncertain.Tasks
git2
togix
- 5.5x faster than baseline (yes, really 😁)git2
- the exposed API is really hard to get fast as well.git
-based ingestion - on a small repo, the speedup is just ~20x; on WebKit it now takes 1.9s on a hot cache, that's 60x faster; on Linux it's 40x faster, on Rust it's 55x faster (without submodules)rev
only - this makes runs onRust
orWebkit
about 100-150ms faster. Not massive, but not nothing to sneeze at either.Correctness Fixes
Affected Issues
git2
forgix
#154.gitattributes
from rev? #31/
? #30Performance Data
git2
togix
GUse index, single-threaded
Use index, multi-threaded
Rust-repo without submodules
Rust-repo - with submodules
Note the
gengo
doesn't consider submodules, so has far less work to do.Webkit profile
This shows that gathering an index for the repository takes a lot of time and it can't be parallelized. The cost of doing business.
Linux - 1% performance loss due to 10 attributes, instead of just 5