Vastly improve performance #157

Byron · 2023-09-03T06:37:01Z

The approach taken here it so provide multiple steps of the transformation which should be reviewed commit by commit. The goal is to start out sticking with the original structure and then transform it into the fastest possible solution. The latter will look quite different, thus the step-wise approach.

Proposed Review Workflow

I'd appreciate if you could make the changes as you see fit and push directly into this PR. gh pr checkout 157 and git push should do the trick for that. That way we can spare each other slow round-trips that typically aren't necessary as you can more quickly do what's needed.

Review Notes

Slashes are universally used now, let's see what happens with the tests on windows.
test/javascript branch now has to be adjusted to contain .gitattributes as I changed the .gitattributes sources. Feel free to just drop the commit though if it's too uncertain.

Tasks

git2 to gix - 5.5x faster than baseline (yes, really 😁)
use index instead of manual tree traversal (enables parallelization, later submodule handling)
- 5.7x faster than baseline, webkit finishes in 15s now (down from 2min), an ~8x speedup (we are used to this now ;)). Pretty sure the extreme slowdown is to the inefficient handling of attributes in git2 - the exposed API is really hard to get fast as well.
parallelize git-based ingestion - on a small repo, the speedup is just ~20x; on WebKit it now takes 1.9s on a hot cache, that's 60x faster; on Linux it's 40x faster, on Rust it's 55x faster (without submodules)
submodule support - all timings are the same, but for Rust it changes to only 8.5x faster as it now considers a lot more files.
Use attributes from rev only - this makes runs on Rust or Webkit about 100-150ms faster. Not massive, but not nothing to sneeze at either.

Correctness Fixes

don't assume all paths are UTF8 encoded, but skip those who don't transform correctly (typically on windows)

Affected Issues

Fixes Drop git2 for gix #154
Fixes Parallelize #155
Probably fixes Use .gitattributes from rev? #31
Fixes Normalize paths to /? #30

Performance Data

`git2` to `gix`G

gitoxide ( improvements) [$]
❯ hyperfine -N --warmup 1 /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo gengo
Benchmark 1: /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo
  Time (mean ± σ):      68.9 ms ±   0.6 ms    [User: 61.0 ms, System: 7.2 ms]
  Range (min … max):    67.7 ms …  71.4 ms    43 runs

Benchmark 2: gengo
  Time (mean ± σ):     380.8 ms ±   1.3 ms    [User: 146.6 ms, System: 232.4 ms]
  Range (min … max):   378.6 ms … 383.1 ms    10 runs

Summary
  /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo ran
    5.53 ± 0.05 times faster than gengo

Use index, single-threaded

gitoxide ( improvements) [$] took 7s
❯ hyperfine -N --warmup 1 /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo gengo
Benchmark 1: /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo
  Time (mean ± σ):      66.8 ms ±   0.9 ms    [User: 59.6 ms, System: 6.6 ms]
  Range (min … max):    65.5 ms …  69.3 ms    44 runs

Benchmark 2: gengo
  Time (mean ± σ):     381.4 ms ±   1.5 ms    [User: 147.2 ms, System: 232.3 ms]
  Range (min … max):   377.8 ms … 383.0 ms    10 runs

Summary
  /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo ran
    5.71 ± 0.08 times faster than gengo

Use index, multi-threaded

gitoxide ( improvements) [$] took 3s
❯ hyperfine -N --warmup 1 /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo gengo
Benchmark 1: /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo
  Time (mean ± σ):      19.2 ms ±   0.4 ms    [User: 74.0 ms, System: 23.2 ms]
  Range (min … max):    18.5 ms …  20.8 ms    144 runs

Benchmark 2: gengo
  Time (mean ± σ):     382.2 ms ±   2.6 ms    [User: 147.0 ms, System: 233.4 ms]
  Range (min … max):   378.5 ms … 385.8 ms    10 runs

Summary
  /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo ran
   19.92 ± 0.43 times faster than gengo

linux ( master) +369 -819 [!] took 8s
❯ hyperfine -N --warmup 1 /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo gengos
Benchmark 1: /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo
  Time (mean ± σ):     655.8 ms ±   5.3 ms    [User: 4941.1 ms, System: 482.2 ms]
  Range (min … max):   644.6 ms … 662.3 ms    10 runs

Benchmark 2: gengos
Error: Failed to run command 'gengos': No such file or directory (os error 2)

linux ( master) +369 -819 [!] took 8s
❯ hyperfine -N --warmup 1 /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo gengo
Benchmark 1: /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo
  Time (mean ± σ):     669.3 ms ±  15.9 ms    [User: 4918.0 ms, System: 485.3 ms]
  Range (min … max):   653.7 ms … 695.1 ms    10 runs

Benchmark 2: gengo
  Time (mean ± σ):     27.069 s ±  0.308 s    [User: 11.664 s, System: 15.397 s]
  Range (min … max):   26.679 s … 27.570 s    10 runs

Summary
  /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo ran
   40.44 ± 1.06 times faster than gengo

Rust-repo without submodules

rust ( master) via 🐍
❯ hyperfine -N --warmup 1 /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo gengo
Benchmark 1: /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo
  Time (mean ± σ):     144.1 ms ±   6.7 ms    [User: 920.6 ms, System: 137.7 ms]
  Range (min … max):   139.2 ms … 170.6 ms    21 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: gengo
  Time (mean ± σ):      8.052 s ±  0.145 s    [User: 2.771 s, System: 5.225 s]
  Range (min … max):    7.849 s …  8.297 s    10 runs

Summary
  /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo ran
   55.86 ± 2.80 times faster than gengo

Rust-repo - with submodules

Note the gengo doesn't consider submodules, so has far less work to do.

rust ( master) +11 -11 via 🐍 took 7s
❯ hyperfine -N --warmup 1 /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo gengo
Benchmark 1: /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo
  Time (mean ± σ):     932.6 ms ±   3.2 ms    [User: 6813.0 ms, System: 831.0 ms]
  Range (min … max):   929.2 ms … 939.4 ms    10 runs

Benchmark 2: gengo
  Time (mean ± σ):      7.967 s ±  0.100 s    [User: 2.749 s, System: 5.205 s]
  Range (min … max):    7.812 s …  8.100 s    10 runs

Summary
  /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo ran
    8.54 ± 0.11 times faster than gengo

Webkit profile

This shows that gathering an index for the repository takes a lot of time and it can't be parallelized. The cost of doing business.

Linux - 1% performance loss due to 10 attributes, instead of just 5

❯ hyperfine -N --warmup 1 /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo gengo
Benchmark 1: /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo
  Time (mean ± σ):     651.3 ms ±   5.2 ms    [User: 4939.9 ms, System: 384.2 ms]
  Range (min … max):   643.2 ms … 658.2 ms    10 runs

Benchmark 2: gengo
  Time (mean ± σ):     644.8 ms ±   4.4 ms    [User: 4899.4 ms, System: 372.5 ms]
  Range (min … max):   637.5 ms … 650.2 ms    10 runs

Summary
  gengo ran
    1.01 ± 0.01 times faster than /Users/byron/dev/github.com/spenserblack/gengo/target/release/gengo

spenserblack · 2023-09-03T14:28:12Z

This is awesome! I'll review more when I get a chance, but just some quick responses:

I recommend either keeping support for both gengo- and linguist- attributes, but would prefer to only keep the linguist-* set for 1% more

I'd like to not support linguist- attributes. While this crate takes a lot of inspiration from linguist, I want the freedom to diverge from how linguist does things. For example, using different names for languages. And right now the -detectable behavior is (intentionally) different.

Note to self: submodules should be detected as vendored by default.

Byron · 2023-09-03T14:32:44Z

Great, take your time!

I will fix CI and undo the change to attributes. Also I plan to make a new release of gix with a QoL improvement, but besides that what you see should be pretty much 'it'.

When looking at linguist and remembering that it took 5min on a repo, I couldn't stop to wonder if they run this at GitHub scale. If so, how much time must be wasted 🥺.

Doing this avoids touching the disk.

Byron · 2023-09-03T15:35:37Z

CI should be fixed now, and the gix version used here will stay the same after all. Thus it should definitely be ready now :).

spenserblack · 2023-09-03T17:48:12Z

When looking at linguist and remembering that it took 5min on a repo, I couldn't stop to wonder if they run this at GitHub scale. If so, how much time must be wasted 🥺.

Performance is probably much better on GitHub's hardware than my own limited VM 😆 but yeah, it does raise that question. And Linguist has had some issues -- recently, one of its heuristics was a ReDOS that stopped some files from rendering or getting highlighted. Ironically, that ReDOS was introduced in an attempt to get heuristics.yml to use more generic regexes for a Rust port of linguist 😆

There's also the question of what exactly causes the time spent for the Linux repo. Is it simply the file count, or are there file(s) that severely hit their Bayesian classifier, or something else? Usage as a library might be more performant as a binary, too. If the classifier is set up for each time you run the binary, but only once for the library, for example.

I guess it's time to admit my hopeful thinking: I know that GitHub switched their code search to a Rust implementation for more performance. Could there perhaps be a Rust project for GitHub's language detection? 🤔 😆

spenserblack · 2023-09-03T17:49:03Z

BTW don't worry about the failing tests for now. I'll update the .gitattributes in test/javascript and update the snapshots as needed.

codecov · 2023-09-03T17:49:36Z

Codecov Report

Merging #157 (d92c7ba) into main (38b98a3) will decrease coverage by 1.33%.
The diff coverage is 80.14%.

@@            Coverage Diff             @@
##             main     #157      +/-   ##
==========================================
- Coverage   83.65%   82.33%   -1.33%     
==========================================
  Files          14       14              
  Lines         514      566      +52     
==========================================
+ Hits          430      466      +36     
- Misses         84      100      +16

Flag	Coverage Δ
	`82.33% <80.14%> (-1.33%)`	⬇️
macOS-latest	`82.26% <79.85%> (-1.39%)`	⬇️
ubuntu-latest	`79.54% <77.11%> (-0.94%)`	⬇️
windows-latest	`78.36% <79.10%> (+0.15%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed	Coverage Δ
gengo/src/builder.rs	`88.46% <50.00%> (+3.84%)`	⬆️
gengo/src/analysis/mod.rs	`79.31% <80.00%> (-8.19%)`	⬇️
gengo/src/lib.rs	`81.15% <81.65%> (-5.71%)`	⬇️
gengo-bin/src/cli.rs	`87.50% <100.00%> (ø)`

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Byron · 2023-09-03T18:18:12Z

I guess it's time to admit my hopeful thinking: I know that GitHub switched their code search to a Rust implementation for more performance. Could there perhaps be a Rust project for GitHub's language detection? 🤔 😆

I actually had a feeling :D! But then you said there are differences and incompatibilities which makes that less likely to happen, so they would probably rather do their own RiR. Time will tell what's going to happen though :).

Do you have an idea why people would add linguist-* attributes in the first place? Does that affect GitHub language statistics?

spenserblack · 2023-09-03T19:27:16Z

Do you have an idea why people would add linguist-* attributes in the first place?

One of the most common is probably linguist-language. Linguist does a good job of detecting languages, but it's not perfect. Personally, I've used .vscode/*.json linguist-language=JSON-with-comments a lot, since linguist often detects "JSON with comments" as "JSON without comments." This affects not just language statistics, but syntax highlighting.

People use linguist-documentation, linguist-vendored, and linguist-generated almost interchangeably as a way to tell linguist to ignore certain files. linguist-generated is supposed to also exclude files from PR diffs by default (see how Cargo.lock diffs aren't rendered in PRs), which some like and some don't.

"prose" and "data" are excluded by default from language statistics, so sometimes you want to use linguist-detectable to include them. As an example, Markdown is considered "prose," but you might have a website that's generated from Markdown, in which case you would want *.md linguist-detectable.

One of the intentional divergences from linguist is how these attributes behave: the linguist maintainers will admit that these attributes can be a bit confusing. So Cargo.lock linguist-detectable doesn't work AFAIK because Cargo.lock is considered generated, and not counted even if you use -detectable. For less confusion, gengo-detectable means "detectable no matter what."

Related: #157

See: b5caa95

spenserblack

Thanks again!

gengo-bin/Cargo.toml

gengo/src/analysis/mod.rs

gengo/src/builder.rs

gengo/src/lib.rs

I intend to make a new release once spenserblack#157 is merged.

Byron · 2023-09-04T16:57:00Z

Thanks for the review.
Could you do me a small favor and apply the style changes (and any other changes) yourself? It's not worth your time to write about it as it's so easy to fix at least while in front of a computer. Besides that, I am happy to answer any question that might come up, or address bigger issues that are more than a quick-fix. Thank again.

spenserblack · 2023-09-04T17:02:03Z

Could you do me a small favor and apply the style changes (and any other changes) yourself?

Yup! I wasn't that clear, but some of these are to dos for myself. I'd appreciate if quickly skimmed over my style changes/refactoring later, though, just to confirm the changes are purely code style, not affecting behavior.

The `max-performance` and `max-performance-safe` gix features have been re-exported so that consumers can enable/disable them without importing a matching version of `gix`.

spenserblack · 2023-09-04T17:32:00Z

🤢 Per-OS snapshot tests certainly have their downsides 😆 I should've learned from repofetch.

@Byron as the gix expert, do you have any recommendations on detecting files from submodules as vendored? My lazy self was thinking of just adding a is_submodule boolean to the stack 😆

It would probably be cleanest to implement this in the Vendored struct, but that would probably complicate constructing the struct.

spenserblack · 2023-09-04T17:53:56Z

🤔 Now that I think about it, while traversing submodules is really cool, I wonder if it should be dropped. I'd want to mark submodules as vendored, but that is very specific to git, and could block, or at least make much more difficult to implement, other file sources. And submodules are technically other repos, and usually not code that belongs to the main project...

Byron · 2023-09-04T17:59:47Z

@Byron as the gix expert, do you have any recommendations on detecting files from submodules as vendored? My lazy self was thinking of just adding a is_submodule boolean to the stack 😆

I thought the same - with that flag it should be possible to implement any logic. And yes, it's false just for the first repo :D.

🤔 Now that I think about it, while traversing submodules is really cool, I wonder if it should be dropped. I'd want to mark submodules as vendored, but that is very specific to git, and could block, or at least make much more difficult to implement, other file sources. And submodules are technically other repos, and usually not code that belongs to the main project...

Not traversing submodules is faster, of course, and maybe for that reason alone it could be a flag to be configured. However, I think it's nothing to skip as, at least I, am interesting how much of the project is vendored. Maybe that's entirely the wrong thing to do though, and it's better to look at how linguist does it.

Byron · 2023-09-04T19:58:59Z

I thought about this again and definitely think that submodules should not be ignored. However, it's the question what to do with them by default or… how to allow users to affect what they are considered. There is no problem in assigning attributes to submodules, and I think this could be used to affect if they should be vendored or not. This reminds me, maybe one should check for is_unset() to disable the vendor-by-default behaviour?

spenserblack · 2023-09-04T21:56:27Z

How would .gitattributes get resolved for a submodule's files, by the way? I've never experimented with it, but I would assume that they wouldn't be able to "affect" a submodule's files. If you're modifying submodule /A, then the following .gitattributes in the root repo wouldn't really do anything, right?

# /.gitattributes
A/* eol=lf

Anyway, for now I'm really excited about this PR, so I'd be happy to merge this and continue the discussion on how to handle submodules later. So perhaps just an is_submodule flag on the stack for now?

spenserblack · 2023-09-05T01:47:00Z

OK, thinking about it a bit more, it might be nice for Vendored to have an add_vendored_dir method, or something similar. Any submodule we encounter could have its path registered to that. The code would be a bit cleaner and more testable that way IMO.

Byron · 2023-09-05T05:26:11Z

How would .gitattributes get resolved for a submodule's files, by the way? I've never experimented with it, but I would assume that they wouldn't be able to "affect" a submodule's files. If you're modifying submodule /A, then the following .gitattributes in the root repo wouldn't really do anything, right?

Actually it would - it's perfectly viable to use complete paths of submodule files (as relative to the superproject root) for an attribute query in the superproject. But that's not what I meant - I merely meant to check the path of the submodule/worktree against the superproject to get attributes for configuration. With that, whatever default you chose could be overridden.

Anyway, for now I'm really excited about this PR, so I'd be happy to merge this and continue the discussion on how to handle submodules later. So perhaps just an is_submodule flag on the stack for now?

That would be enough to declare all submodule files vendored, I think.

OK, thinking about it a bit more, it might be nice for Vendored to have an add_vendored_dir method, or something similar. Any submodule we encounter could have its path registered to that. The code would be a bit cleaner and more testable that way IMO.

Maybe that doesn't matter (yet) for merging the PR like you mentioned. However, I don't think submodules should always be vendored, there should be en escape hatch, that just treats them like ordinary code that is part of the tree (like it's done now). Then tying submodules into vendor-handling too much seems counter-intuitive. But again, in the end it doesn't matter as long as the code does what it should and you define what that is.

spenserblack · 2023-09-05T12:20:47Z

Going to merge as-is, we can handle submodules in a separate PR. I'll see if I can get something set up.

I agree that submodules shouldn't be always vendored, but I do think that should be the default, since "vendored" basically means "files from another project that are provided in this repo."

spenserblack · 2023-09-05T12:23:22Z

@all-contributors add @Byron for code and userTesting

allcontributors · 2023-09-05T12:23:31Z

@spenserblack

I've put up a pull request to add @Byron! 🎉

Thought this was done in #157.

Byron force-pushed the git2-to-gix branch from 3b7f8f8 to 60a664b Compare September 3, 2023 10:26

Byron added 3 commits September 3, 2023 15:18

Replace git2 with gix

25bcc01

Use the index directly instead of traversing the tree

ec7bc56

Run the git analyser enginer in parallel

26d48f7

Byron force-pushed the git2-to-gix branch from b89be10 to 97df5d3 Compare September 3, 2023 13:30

Byron marked this pull request as ready for review September 3, 2023 14:11

Byron requested a review from spenserblack as a code owner September 3, 2023 14:11

Byron changed the title ~~git2-to-gix~~ git2-to-gix - 60x faster on WebKit, 40x faster on Linux Sep 3, 2023

Byron added 2 commits September 3, 2023 17:12

Also traverse into submodules

bd5a7ec

Use attributes from revision only, for performance and precision

7cc48fe

Doing this avoids touching the disk.

Byron force-pushed the git2-to-gix branch from 56a5fcb to 7cc48fe Compare September 3, 2023 15:12

Byron mentioned this pull request Sep 3, 2023

Switch to gengo o2sh/onefetch#1152

Draft

2 tasks

spenserblack mentioned this pull request Sep 3, 2023

Use .gitattributes from rev? #31

Closed

spenserblack added a commit that referenced this pull request Sep 4, 2023

Add back gitattributes

b5caa95

Related: #157

Update snapshots

10794f0

See: b5caa95

spenserblack reviewed Sep 4, 2023

View reviewed changes

spenserblack and others added 4 commits September 4, 2023 12:37

Bump version

e968dd3

I intend to make a new release once spenserblack#157 is merged.

Accept windows snapshot to fix CI

873a4f1

Fix styling nitpicks

1b4a019

Add note

d2f545d

Refactor partitioning of submodules and entries

0b79b1a

spenserblack added 2 commits September 4, 2023 13:06

Add name of type back to Debug impl

8207896

Refactor how gix features are enabled/disabled

eca31d1

The `max-performance` and `max-performance-safe` gix features have been re-exported so that consumers can enable/disable them without importing a matching version of `gix`.

Unify OS snapshots

d92c7ba

spenserblack changed the title ~~git2-to-gix - 60x faster on WebKit, 40x faster on Linux~~ Vastly improve performance Sep 4, 2023

spenserblack merged commit 85e0605 into spenserblack:main Sep 5, 2023
9 of 11 checks passed

Byron deleted the git2-to-gix branch September 5, 2023 12:22

allcontributors bot mentioned this pull request Sep 5, 2023

📝 Add Byron as a contributor for code, and userTesting #162

Merged

spenserblack mentioned this pull request Sep 5, 2023

Handle submodules #163

Closed

spenserblack added a commit that referenced this pull request Sep 5, 2023

Drop git2

810ff9b

Thought this was done in #157.

spenserblack mentioned this pull request Sep 5, 2023

Mark submodules as vendored by default #164

Merged

spenserblack mentioned this pull request Sep 22, 2023

chore: Move git-related stuff to its own module #165

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vastly improve performance #157

Vastly improve performance #157

Byron commented Sep 3, 2023 •

edited

Loading

spenserblack commented Sep 3, 2023

Byron commented Sep 3, 2023

Byron commented Sep 3, 2023

spenserblack commented Sep 3, 2023

spenserblack commented Sep 3, 2023

codecov bot commented Sep 3, 2023 •

edited

Loading

Byron commented Sep 3, 2023

spenserblack commented Sep 3, 2023 •

edited

Loading

spenserblack left a comment

Byron commented Sep 4, 2023

spenserblack commented Sep 4, 2023

spenserblack commented Sep 4, 2023

spenserblack commented Sep 4, 2023

Byron commented Sep 4, 2023

Byron commented Sep 4, 2023

spenserblack commented Sep 4, 2023

spenserblack commented Sep 5, 2023

Byron commented Sep 5, 2023

spenserblack commented Sep 5, 2023

spenserblack commented Sep 5, 2023

allcontributors bot commented Sep 5, 2023

Vastly improve performance #157

Vastly improve performance #157

Conversation

Byron commented Sep 3, 2023 • edited Loading

Proposed Review Workflow

Review Notes

Tasks

Correctness Fixes

Affected Issues

git2 to gixG

Use index, single-threaded

Use index, multi-threaded

Rust-repo without submodules

Rust-repo - with submodules

Webkit profile

Linux - 1% performance loss due to 10 attributes, instead of just 5

spenserblack commented Sep 3, 2023

Byron commented Sep 3, 2023

Byron commented Sep 3, 2023

spenserblack commented Sep 3, 2023

spenserblack commented Sep 3, 2023

codecov bot commented Sep 3, 2023 • edited Loading

Codecov Report

Byron commented Sep 3, 2023

spenserblack commented Sep 3, 2023 • edited Loading

spenserblack left a comment

Choose a reason for hiding this comment

Byron commented Sep 4, 2023

spenserblack commented Sep 4, 2023

spenserblack commented Sep 4, 2023

spenserblack commented Sep 4, 2023

Byron commented Sep 4, 2023

Byron commented Sep 4, 2023

spenserblack commented Sep 4, 2023

spenserblack commented Sep 5, 2023

Byron commented Sep 5, 2023

spenserblack commented Sep 5, 2023

spenserblack commented Sep 5, 2023

allcontributors bot commented Sep 5, 2023

Byron commented Sep 3, 2023 •

edited

Loading

`git2` to `gix`G

codecov bot commented Sep 3, 2023 •

edited

Loading

spenserblack commented Sep 3, 2023 •

edited

Loading