Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vastly improve performance #157

Merged
merged 14 commits into from
Sep 5, 2023
Merged

Conversation

Byron
Copy link
Contributor

@Byron Byron commented Sep 3, 2023

The approach taken here it so provide multiple steps of the transformation which should be reviewed commit by commit. The goal is to start out sticking with the original structure and then transform it into the fastest possible solution. The latter will look quite different, thus the step-wise approach.

Proposed Review Workflow

I'd appreciate if you could make the changes as you see fit and push directly into this PR. gh pr checkout 157 and git push should do the trick for that. That way we can spare each other slow round-trips that typically aren't necessary as you can more quickly do what's needed.

Review Notes

  • Slashes are universally used now, let's see what happens with the tests on windows.
  • test/javascript branch now has to be adjusted to contain .gitattributes as I changed the .gitattributes sources. Feel free to just drop the commit though if it's too uncertain.

Tasks

  • git2 to gix - 5.5x faster than baseline (yes, really 😁)
  • use index instead of manual tree traversal (enables parallelization, later submodule handling)
    • 5.7x faster than baseline, webkit finishes in 15s now (down from 2min), an ~8x speedup (we are used to this now ;)). Pretty sure the extreme slowdown is to the inefficient handling of attributes in git2 - the exposed API is really hard to get fast as well.
  • parallelize git-based ingestion - on a small repo, the speedup is just ~20x; on WebKit it now takes 1.9s on a hot cache, that's 60x faster; on Linux it's 40x faster, on Rust it's 55x faster (without submodules)
  • submodule support - all timings are the same, but for Rust it changes to only 8.5x faster as it now considers a lot more files.
  • Use attributes from rev only - this makes runs on Rust or Webkit about 100-150ms faster. Not massive, but not nothing to sneeze at either.

Correctness Fixes

  • don't assume all paths are UTF8 encoded, but skip those who don't transform correctly (typically on windows)

Affected Issues

Performance Data
git2 to gixG
gitoxide ( improvements) [$]
❯ hyperfine -N --warmup 1 /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo gengo
Benchmark 1: /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo
  Time (mean ± σ):      68.9 ms ±   0.6 ms    [User: 61.0 ms, System: 7.2 ms]
  Range (min … max):    67.7 ms …  71.4 ms    43 runs

Benchmark 2: gengo
  Time (mean ± σ):     380.8 ms ±   1.3 ms    [User: 146.6 ms, System: 232.4 ms]
  Range (min … max):   378.6 ms … 383.1 ms    10 runs

Summary
  /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo ran
    5.53 ± 0.05 times faster than gengo
Use index, single-threaded
gitoxide ( improvements) [$] took 7s
❯ hyperfine -N --warmup 1 /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo gengo
Benchmark 1: /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo
  Time (mean ± σ):      66.8 ms ±   0.9 ms    [User: 59.6 ms, System: 6.6 ms]
  Range (min … max):    65.5 ms …  69.3 ms    44 runs

Benchmark 2: gengo
  Time (mean ± σ):     381.4 ms ±   1.5 ms    [User: 147.2 ms, System: 232.3 ms]
  Range (min … max):   377.8 ms … 383.0 ms    10 runs

Summary
  /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo ran
    5.71 ± 0.08 times faster than gengo

Use index, multi-threaded

gitoxide ( improvements) [$] took 3s
❯ hyperfine -N --warmup 1 /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo gengo
Benchmark 1: /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo
  Time (mean ± σ):      19.2 ms ±   0.4 ms    [User: 74.0 ms, System: 23.2 ms]
  Range (min … max):    18.5 ms …  20.8 ms    144 runs

Benchmark 2: gengo
  Time (mean ± σ):     382.2 ms ±   2.6 ms    [User: 147.0 ms, System: 233.4 ms]
  Range (min … max):   378.5 ms … 385.8 ms    10 runs

Summary
  /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo ran
   19.92 ± 0.43 times faster than gengo
linux ( master) +369 -819 [!] took 8s
❯ hyperfine -N --warmup 1 /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo gengos
Benchmark 1: /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo
  Time (mean ± σ):     655.8 ms ±   5.3 ms    [User: 4941.1 ms, System: 482.2 ms]
  Range (min … max):   644.6 ms … 662.3 ms    10 runs

Benchmark 2: gengos
Error: Failed to run command 'gengos': No such file or directory (os error 2)

linux ( master) +369 -819 [!] took 8s
❯ hyperfine -N --warmup 1 /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo gengo
Benchmark 1: /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo
  Time (mean ± σ):     669.3 ms ±  15.9 ms    [User: 4918.0 ms, System: 485.3 ms]
  Range (min … max):   653.7 ms … 695.1 ms    10 runs

Benchmark 2: gengo
  Time (mean ± σ):     27.069 s ±  0.308 s    [User: 11.664 s, System: 15.397 s]
  Range (min … max):   26.679 s … 27.570 s    10 runs

Summary
  /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo ran
   40.44 ± 1.06 times faster than gengo

Rust-repo without submodules

rust ( master) via 🐍
❯ hyperfine -N --warmup 1 /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo gengo
Benchmark 1: /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo
  Time (mean ± σ):     144.1 ms ±   6.7 ms    [User: 920.6 ms, System: 137.7 ms]
  Range (min … max):   139.2 ms … 170.6 ms    21 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: gengo
  Time (mean ± σ):      8.052 s ±  0.145 s    [User: 2.771 s, System: 5.225 s]
  Range (min … max):    7.849 s …  8.297 s    10 runs

Summary
  /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo ran
   55.86 ± 2.80 times faster than gengo

Rust-repo - with submodules

Note the gengo doesn't consider submodules, so has far less work to do.

rust ( master) +11 -11 via 🐍 took 7s
❯ hyperfine -N --warmup 1 /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo gengo
Benchmark 1: /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo
  Time (mean ± σ):     932.6 ms ±   3.2 ms    [User: 6813.0 ms, System: 831.0 ms]
  Range (min … max):   929.2 ms … 939.4 ms    10 runs

Benchmark 2: gengo
  Time (mean ± σ):      7.967 s ±  0.100 s    [User: 2.749 s, System: 5.205 s]
  Range (min … max):    7.812 s …  8.100 s    10 runs

Summary
  /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo ran
    8.54 ± 0.11 times faster than gengo

Webkit profile

This shows that gathering an index for the repository takes a lot of time and it can't be parallelized. The cost of doing business.

Screenshot 2023-09-03 at 15 00 33

Linux - 1% performance loss due to 10 attributes, instead of just 5

❯ hyperfine -N --warmup 1 /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo gengo
Benchmark 1: /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo
  Time (mean ± σ):     651.3 ms ±   5.2 ms    [User: 4939.9 ms, System: 384.2 ms]
  Range (min … max):   643.2 ms … 658.2 ms    10 runs

Benchmark 2: gengo
  Time (mean ± σ):     644.8 ms ±   4.4 ms    [User: 4899.4 ms, System: 372.5 ms]
  Range (min … max):   637.5 ms … 650.2 ms    10 runs

Summary
  gengo ran
    1.01 ± 0.01 times faster than /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo

@Byron Byron marked this pull request as ready for review September 3, 2023 14:11
@Byron Byron changed the title git2-to-gix git2-to-gix - 60x faster on WebKit, 40x faster on Linux Sep 3, 2023
@spenserblack
Copy link
Owner

This is awesome! I'll review more when I get a chance, but just some quick responses:

I recommend either keeping support for both gengo- and linguist- attributes, but would prefer to only keep the linguist-* set for 1% more

I'd like to not support linguist- attributes. While this crate takes a lot of inspiration from linguist, I want the freedom to diverge from how linguist does things. For example, using different names for languages. And right now the -detectable behavior is (intentionally) different.


Note to self: submodules should be detected as vendored by default.

@Byron
Copy link
Contributor Author

Byron commented Sep 3, 2023

Great, take your time!

I will fix CI and undo the change to attributes. Also I plan to make a new release of gix with a QoL improvement, but besides that what you see should be pretty much 'it'.


When looking at linguist and remembering that it took 5min on a repo, I couldn't stop to wonder if they run this at GitHub scale. If so, how much time must be wasted 🥺.

@Byron
Copy link
Contributor Author

Byron commented Sep 3, 2023

CI should be fixed now, and the gix version used here will stay the same after all. Thus it should definitely be ready now :).

@spenserblack
Copy link
Owner

When looking at linguist and remembering that it took 5min on a repo, I couldn't stop to wonder if they run this at GitHub scale. If so, how much time must be wasted 🥺.

Performance is probably much better on GitHub's hardware than my own limited VM 😆 but yeah, it does raise that question. And Linguist has had some issues -- recently, one of its heuristics was a ReDOS that stopped some files from rendering or getting highlighted. Ironically, that ReDOS was introduced in an attempt to get heuristics.yml to use more generic regexes for a Rust port of linguist 😆

There's also the question of what exactly causes the time spent for the Linux repo. Is it simply the file count, or are there file(s) that severely hit their Bayesian classifier, or something else? Usage as a library might be more performant as a binary, too. If the classifier is set up for each time you run the binary, but only once for the library, for example.


I guess it's time to admit my hopeful thinking: I know that GitHub switched their code search to a Rust implementation for more performance. Could there perhaps be a Rust project for GitHub's language detection? 🤔 😆

@spenserblack
Copy link
Owner

BTW don't worry about the failing tests for now. I'll update the .gitattributes in test/javascript and update the snapshots as needed.

@codecov
Copy link

codecov bot commented Sep 3, 2023

Codecov Report

Merging #157 (d92c7ba) into main (38b98a3) will decrease coverage by 1.33%.
The diff coverage is 80.14%.

@@            Coverage Diff             @@
##             main     #157      +/-   ##
==========================================
- Coverage   83.65%   82.33%   -1.33%     
==========================================
  Files          14       14              
  Lines         514      566      +52     
==========================================
+ Hits          430      466      +36     
- Misses         84      100      +16     
Flag Coverage Δ
82.33% <80.14%> (-1.33%) ⬇️
macOS-latest 82.26% <79.85%> (-1.39%) ⬇️
ubuntu-latest 79.54% <77.11%> (-0.94%) ⬇️
windows-latest 78.36% <79.10%> (+0.15%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
gengo/src/builder.rs 88.46% <50.00%> (+3.84%) ⬆️
gengo/src/analysis/mod.rs 79.31% <80.00%> (-8.19%) ⬇️
gengo/src/lib.rs 81.15% <81.65%> (-5.71%) ⬇️
gengo-bin/src/cli.rs 87.50% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@Byron
Copy link
Contributor Author

Byron commented Sep 3, 2023

I guess it's time to admit my hopeful thinking: I know that GitHub switched their code search to a Rust implementation for more performance. Could there perhaps be a Rust project for GitHub's language detection? 🤔 😆

I actually had a feeling :D! But then you said there are differences and incompatibilities which makes that less likely to happen, so they would probably rather do their own RiR. Time will tell what's going to happen though :).

Do you have an idea why people would add linguist-* attributes in the first place? Does that affect GitHub language statistics?

@spenserblack
Copy link
Owner

spenserblack commented Sep 3, 2023

Do you have an idea why people would add linguist-* attributes in the first place?

One of the most common is probably linguist-language. Linguist does a good job of detecting languages, but it's not perfect. Personally, I've used .vscode/*.json linguist-language=JSON-with-comments a lot, since linguist often detects "JSON with comments" as "JSON without comments." This affects not just language statistics, but syntax highlighting.

People use linguist-documentation, linguist-vendored, and linguist-generated almost interchangeably as a way to tell linguist to ignore certain files. linguist-generated is supposed to also exclude files from PR diffs by default (see how Cargo.lock diffs aren't rendered in PRs), which some like and some don't.

"prose" and "data" are excluded by default from language statistics, so sometimes you want to use linguist-detectable to include them. As an example, Markdown is considered "prose," but you might have a website that's generated from Markdown, in which case you would want *.md linguist-detectable.

One of the intentional divergences from linguist is how these attributes behave: the linguist maintainers will admit that these attributes can be a bit confusing. So Cargo.lock linguist-detectable doesn't work AFAIK because Cargo.lock is considered generated, and not counted even if you use -detectable. For less confusion, gengo-detectable means "detectable no matter what."

spenserblack added a commit that referenced this pull request Sep 4, 2023
Copy link
Owner

@spenserblack spenserblack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again!

gengo-bin/Cargo.toml Outdated Show resolved Hide resolved
gengo/src/analysis/mod.rs Outdated Show resolved Hide resolved
gengo/src/analysis/mod.rs Show resolved Hide resolved
gengo/src/builder.rs Outdated Show resolved Hide resolved
gengo/src/lib.rs Outdated Show resolved Hide resolved
gengo/src/lib.rs Show resolved Hide resolved
gengo/src/lib.rs Show resolved Hide resolved
@Byron
Copy link
Contributor Author

Byron commented Sep 4, 2023

Thanks for the review.
Could you do me a small favor and apply the style changes (and any other changes) yourself? It's not worth your time to write about it as it's so easy to fix at least while in front of a computer. Besides that, I am happy to answer any question that might come up, or address bigger issues that are more than a quick-fix. Thank again.

@spenserblack
Copy link
Owner

Could you do me a small favor and apply the style changes (and any other changes) yourself?

Yup! I wasn't that clear, but some of these are to dos for myself. I'd appreciate if quickly skimmed over my style changes/refactoring later, though, just to confirm the changes are purely code style, not affecting behavior.

The `max-performance` and `max-performance-safe` gix features have been
re-exported so that consumers can enable/disable them without importing
a matching version of `gix`.
@spenserblack
Copy link
Owner

🤢 Per-OS snapshot tests certainly have their downsides 😆 I should've learned from repofetch.


@Byron as the gix expert, do you have any recommendations on detecting files from submodules as vendored? My lazy self was thinking of just adding a is_submodule boolean to the stack 😆

It would probably be cleanest to implement this in the Vendored struct, but that would probably complicate constructing the struct.

@spenserblack spenserblack changed the title git2-to-gix - 60x faster on WebKit, 40x faster on Linux Vastly improve performance Sep 4, 2023
@spenserblack
Copy link
Owner

🤔 Now that I think about it, while traversing submodules is really cool, I wonder if it should be dropped. I'd want to mark submodules as vendored, but that is very specific to git, and could block, or at least make much more difficult to implement, other file sources. And submodules are technically other repos, and usually not code that belongs to the main project...

@Byron
Copy link
Contributor Author

Byron commented Sep 4, 2023

@Byron as the gix expert, do you have any recommendations on detecting files from submodules as vendored? My lazy self was thinking of just adding a is_submodule boolean to the stack 😆

I thought the same - with that flag it should be possible to implement any logic. And yes, it's false just for the first repo :D.

🤔 Now that I think about it, while traversing submodules is really cool, I wonder if it should be dropped. I'd want to mark submodules as vendored, but that is very specific to git, and could block, or at least make much more difficult to implement, other file sources. And submodules are technically other repos, and usually not code that belongs to the main project...

Not traversing submodules is faster, of course, and maybe for that reason alone it could be a flag to be configured. However, I think it's nothing to skip as, at least I, am interesting how much of the project is vendored. Maybe that's entirely the wrong thing to do though, and it's better to look at how linguist does it.

@Byron
Copy link
Contributor Author

Byron commented Sep 4, 2023

I thought about this again and definitely think that submodules should not be ignored. However, it's the question what to do with them by default or… how to allow users to affect what they are considered. There is no problem in assigning attributes to submodules, and I think this could be used to affect if they should be vendored or not. This reminds me, maybe one should check for is_unset() to disable the vendor-by-default behaviour?

@spenserblack
Copy link
Owner

How would .gitattributes get resolved for a submodule's files, by the way? I've never experimented with it, but I would assume that they wouldn't be able to "affect" a submodule's files. If you're modifying submodule /A, then the following .gitattributes in the root repo wouldn't really do anything, right?

# /.gitattributes
A/* eol=lf

Anyway, for now I'm really excited about this PR, so I'd be happy to merge this and continue the discussion on how to handle submodules later. So perhaps just an is_submodule flag on the stack for now?

@spenserblack
Copy link
Owner

OK, thinking about it a bit more, it might be nice for Vendored to have an add_vendored_dir method, or something similar. Any submodule we encounter could have its path registered to that. The code would be a bit cleaner and more testable that way IMO.

@Byron
Copy link
Contributor Author

Byron commented Sep 5, 2023

How would .gitattributes get resolved for a submodule's files, by the way? I've never experimented with it, but I would assume that they wouldn't be able to "affect" a submodule's files. If you're modifying submodule /A, then the following .gitattributes in the root repo wouldn't really do anything, right?

Actually it would - it's perfectly viable to use complete paths of submodule files (as relative to the superproject root) for an attribute query in the superproject. But that's not what I meant - I merely meant to check the path of the submodule/worktree against the superproject to get attributes for configuration. With that, whatever default you chose could be overridden.

Anyway, for now I'm really excited about this PR, so I'd be happy to merge this and continue the discussion on how to handle submodules later. So perhaps just an is_submodule flag on the stack for now?

That would be enough to declare all submodule files vendored, I think.

OK, thinking about it a bit more, it might be nice for Vendored to have an add_vendored_dir method, or something similar. Any submodule we encounter could have its path registered to that. The code would be a bit cleaner and more testable that way IMO.

Maybe that doesn't matter (yet) for merging the PR like you mentioned. However, I don't think submodules should always be vendored, there should be en escape hatch, that just treats them like ordinary code that is part of the tree (like it's done now). Then tying submodules into vendor-handling too much seems counter-intuitive. But again, in the end it doesn't matter as long as the code does what it should and you define what that is.

@spenserblack
Copy link
Owner

Going to merge as-is, we can handle submodules in a separate PR. I'll see if I can get something set up.

I agree that submodules shouldn't be always vendored, but I do think that should be the default, since "vendored" basically means "files from another project that are provided in this repo."

@spenserblack spenserblack merged commit 85e0605 into spenserblack:main Sep 5, 2023
9 of 11 checks passed
@Byron Byron deleted the git2-to-gix branch September 5, 2023 12:22
@spenserblack
Copy link
Owner

@all-contributors add @Byron for code and userTesting

@allcontributors
Copy link
Contributor

@spenserblack

I've put up a pull request to add @Byron! 🎉

@spenserblack spenserblack mentioned this pull request Sep 5, 2023
spenserblack added a commit that referenced this pull request Sep 5, 2023
Thought this was done in #157.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parallelize Drop git2 for gix Use .gitattributes from rev? Normalize paths to /?
2 participants