Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Allow disabling git_status for large/slow repos #921

Closed
wants to merge 1 commit into from

Conversation

jaredwy
Copy link

@jaredwy jaredwy commented Feb 9, 2020

Description

Some repos are very large. This can take a while to run git status on them.

I am creating this as a draft repo as I had some questions and some remaining work but wanted to get some early feedback.

  1. What is the best way to fail reading the config file. E.g. If a path isn't absolute, we should skip it. Do we fail loading or do we warn and ignore.

  2. I added tests and they run but they seem to be ignored. Is there a reason for that? Should I also add mine to be ignored?

  3. I need to test this on windows.

Motivation and Context

As above, the repo I work with at work is rather large and using git status is prohibitively slow. I don't want to not have status on repos where it is still usable. This gives me the ability to turn it off just for the slow repos.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Screenshots (if appropriate):

How Has This Been Tested?

  • I have tested using MacOS
  • I have tested using Linux
  • I have tested using Windows

Checklist:

  • I have updated the documentation accordingly.
  • I have updated the tests accordingly.

@matchai
Copy link
Member

matchai commented Feb 9, 2020

  1. What is the best way to fail reading the config file. E.g. If a path isn't absolute, we should skip it. Do we fail loading or do we warn and ignore.

For the time being we log and ignore. We don't have an error mechanism in place yet.

  1. I added tests and they run but they seem to be ignored. Is there a reason for that? Should I also add mine to be ignored?

Ignored tests are tests that require the test environment to be a particular state (e.g. having a pre-installed dependency), or that modify the testing system in a non-temporary way (e.g. creating test files in ~)

In my opinion, I don't think that adding more configuration is a great solution for git status slowing starship down. I think a more elegant solution would be to have a sensible timeout when executing git status. What are your thoughts?

In a perfect world, we would experiment with using https://github.com/romkatv/gitstatus via FFI, but that may be a big undertaking.

@romkatv
Copy link

romkatv commented Feb 9, 2020

In a perfect world, we would experiment with using https://github.com/romkatv/gitstatus via FFI, but that may be a big undertaking.

Just wanted to mention that there is an alternative way to use gistatus in starship.

You can source the existing shell bindings for gitstatus when enabling starship. This gives you gitstatus_query function within the shell. This function, when called, sets a bunch of parameters that describe the sate of the current Git repository. For example, VCS_STATUS_LOCAL_BRANCH is the branch name, VCS_STATUS_NUM_UNTRACKED is the number of untracked files and so on.

Call gitstatus_query from the shell before every prompt. When gitstatus_query returns, call starship and pass it all VCS_STATUS_* parameters the same way you pass $?, $pipestatus and the like.

Just an idea. This isn't necessarily easy but might be worth trying. It can result is dramatic speedup: gitstatus is about 40 times faster than libgit2 that starship is using. You can estimate whether the speedup is worth the trouble by trying the native gitstatus prompt or powerlevel10k in the same directory where starship is slow.

@jaredwy
Copy link
Author

jaredwy commented Feb 9, 2020

@matchai even with gitstatus our return time is 3-4s (which is still an improvement from the ~15s) but still slow enough that I would want to disable it for that particular directory.

To put it in context, we have ~25 years of history and a LOT of files in the repo that I work on. Through various means you can get this down to 3-4 seconds, but that is still a lifetime waiting for the prompt to be active.

FYI posh_git has a similar feature for a similar reason https://github.com/dahlbyk/posh-git#customization-variables

Personally I would love this feature, but understand if you don't want to take it on :) I am somewhat opposed to the timeout idea as I know this directory is going to cause problems, it will always timeout, so shouldn't startship just avoid spawning the extra process waiting for a period of time before I can enter a command? I want my shell to be snappy, not sitting around for 1s waiting :)

@matchai
Copy link
Member

matchai commented Feb 10, 2020

so shouldn't starship just avoid spawning the extra process waiting for a period of time before I can enter a command?

Modules are computed on parallel threads, so the duration of the wait, in most cases, shouldn't usually affect how long it takes for the prompt to render. I was thinking that a reasonable timeout would be 100ms, which is generally considered to be the upper-bound of what feels immediate, but lowering it is also an option.

@jaredwy
Copy link
Author

jaredwy commented Feb 10, 2020

I get a massive delay (~15s) rendering are the moment. Wouldn't threads all have to join before rendering?

I don’t think 100ms would be enough for the three main repos I work on, even with gitstatus above. I can get some actual times for that though.

@romkatv
Copy link

romkatv commented Feb 10, 2020

@jaredwy Let's try to figure out why fetching git status is taking so much time.

even with gitstatus our return time is 3-4s

How did you measure this?

To put it in context, we have ~25 years of history and a LOT of files in the repo that I work on.

What's the output of git ls-files | wc -l?

Are you benchmarking in a clean repository?

How long does git status take?

Do timings improve after git gc?

What OS are you using?

On which storage device is the local repository stored? HDD, SATA SSD, NVME SSD?

On which file system is the local repository stored?

@jaredwy
Copy link
Author

jaredwy commented Feb 10, 2020

It’s slow because it’s a huge project, with git status in a clean repo taking a while.:) it’s a known problem with git for very large repos there are mitigation’s for it but come with trade offs (e.g. it won’t track new files etc).

@romkatv
Copy link

romkatv commented Feb 10, 2020

It’s slow because it’s a huge project :)

If it's not too much to ask, could you answer at least some of my questions?

On my machine for gitstatus_query to take 3 seconds the repository must have 30 million files. Does git ls-files | wc -l print a number close to this?

Edit: Fixed a typo: 30 million files rather than 3 million.

@jaredwy
Copy link
Author

jaredwy commented Feb 10, 2020

I’m not at my computer, but I doubt we are at 30 million but it’s going to be over a million.

Os is windows and Mac. Ssd and nvme ssd
Timings don’t improve after a gc.
Git status takes 12-15 seconds.

@romkatv
Copy link

romkatv commented Feb 10, 2020

Thanks!

If plain git status takes 12-15 seconds on macOS or Windows, then 3-4 seconds for gitstatus_query is normal. On Linux the expected difference is about 10x but on other operating systems it can be as low as 4x (I primarily optimize for Linux as this is what I use myself).

Granted, even if gitstatus_query was 100 faster than git status, you still wouldn't be able to use prompt if gitstatus_query was called synchronously. 150ms prompt latency is too much.

I wonder if fetching git status asynchronously would give you good UX. If you have a free moment, could you try https://github.com/romkatv/powerlevel10k to see if it's usable in your huge Git repository out of the box? This requires Zsh. If you aren't using Zsh but can spare a few minutes to test this, here's how:

  1. Install Zsh.
  2. Run these commands:
git clone --depth=1 https://github.com/romkatv/powerlevel10k.git ~/powerlevel10k
echo 'source ~/powerlevel10k/powerlevel10k.zsh-theme' >> ~/.zshrc
  1. Run zsh.

If powerlevel10k is unusable out of the box in the Git repository of yours, type POWERLEVEL9K_VCS_MAX_INDEX_SIZE_DIRTY=100000; p10k reload while in zsh. See below for the meaning of this parameter.

As for ignoring large repositories when computing Git status, I can share how this is done in gitstatus and powerlevel10k. Perhaps it'll help.

  • gitstatus respects status.showUntrackedFiles, bash.showUntrackedFiles and bash.showDirtyState in Git configs by default. The first is a standard option that git status respects. The other two are from https://github.com/git/git/blob/master/contrib/completion/git-prompt.sh. This allows users to mark "slow repos" by modifying their local Git config.
  • gitstatus has a --dirty-max-index-size=N flag that disables workdir and index scan for Git repositories that have more than N files in the index. You still get things like branch name, number of commits ahead/behind, etc., -- all these things are very fast to fetch. POWERLEVEL9K_VCS_MAX_INDEX_SIZE_DIRTY=100000 that I mentioned above gets translated into --dirty-max-index-size=1000000. This gives users a convenient way to disable Git status in prompt for all slow repositories with a single parameter.
  • Powerlevel10k has parameter POWERLEVEL9K_VCS_DISABLED_WORKDIR_PATTERN. It allows you to ignore Git repositories whose workdir matches this pattern. Users who have .git in their home directory (for managing dotfiles) often set POWERLEVEL9K_VCS_DISABLED_WORKDIR_PATTERN='~' so that they don't see git status in prompt when in a random subdirectory of their home directory. Note that disabling of git status is done based on the workdir of the repo rather than on PWD. Using PWD for this check wouldn't allow for the use-case I mentioned above: users don't want to see git status in prompt when in ~/Downloads but do want it in ~/some-git-repo. Multiple patterns can be combined with |.

@jaredwy
Copy link
Author

jaredwy commented Feb 13, 2020

Is it worth me refining this pull request to match the last point in @romkatv comment above. As mentioned above even switching to use gitstatus would result in a pretty unsuable repo.

I don't really agree with the timeout solution as it results in extra unneeded work. I know the slow repos, so I would rather not have to wait the 100ms(or whatever value) for something I know is going to fail.
This would also require a config change to allow for larger repos, I can think of more than a handful than a few repos on my disk that would not return in under 100ms. Although some would be close, and I could live with it, another config option :)

As it stands starship is unusable for me, I can't wait 15 seconds on my main work repo for the prompt to become usable again, but I would rather not disable it totally fort the smaller repos, and there is certainly precedent for it in other systems powerlevel10k and posh-git for example.

I would love to be able to continue to use so happy to work on a solution that makes everyone happy.

@romkatv
Copy link

romkatv commented Feb 14, 2020

Perhaps I should clarify the intention of my previous comment.

There are many potential solutions for the problem of supporting large Git repositories in Starship. Knowing which ones will and won't work in different circumstances is valuable. I've assembled the data you've already provided in this gist. Even thought it has little specifics, it provides valuable insights that could inform the development of Starship. Specifically, it says that in one specific case gitstatus_query is 4 times faster than git status but it's still too slow to be called synchronously before every prompt. Now it's clear that simply replacing libgit2 in Starship with gitstatus won't solve all Git-related performance problems.

My last comment is asking you to check other potential solutions to see if they would work if implemented in Starship.

  • If Starship were to call gitstatus_query asynchronously, would it work for you?
  • If Starship were to skip workdir scan for Git repositories with over 100000 files, would it work for you?
  • If Starship were to honor bash.showDirtyState, would it work for you?

Answering these questions by imagining the solutions is difficult and error prone. Implementing them in Starship before knowing whether they'll work is expensive. A more efficient approach is to try them in powerlevel10k where they are already implemented. This would allow Starship to build on experience from other projects -- adopt things that work well, and skip those that don't.

@jaredwy
Copy link
Author

jaredwy commented Feb 14, 2020

If Starship were to call gitstatus_query asynchronously, would it work for you?

It would work, if prompts allow for async updates. Other wise we will just block waiting for it to complete.
I know zsh supports async prompt updates, but does bash/powershell etc? Im not too sure. Reading over this
https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_prompts?view=powershell-7 I can't see anything that would suggest it does,.

If Starship were to skip workdir scan for Git repositories with over 100000 files, would it work for you?
This could also work, but it seems rather non deterministic to the user. i.e. they add that 100000 and suddenly their prompt changes. Not a great user experience.

If Starship were to honor bash.showDirtyState, would it work for you?

I don't see how that relates to other shells?

@romkatv
Copy link

romkatv commented Feb 14, 2020

I know zsh supports async prompt updates, but does bash/powershell etc? Im not too sure.

Fair enough. You have no obligating to help users of other shells. I respect that.

If Starship were to skip workdir scan for Git repositories with over 100000 files, would it work for you?

This could also work, but it seems rather non deterministic to the user. i.e. they add that 100000 and suddenly their prompt changes. Not a great user experience.

Your PR adds a configuration option to Starship. This is a different configuration option that could be added to Starship. The question is which one is better.

Specifying the upper limit on repository size has the following advantages:

  • The user specifies just one number in their config which defines what it means for a Git repository to be "large". They don't have to list all large repositories one by one.
  • The user still gets most parts of Git status even in large repositories: branch, ahead/behind, staged changes, etc.
  • The user gets an indication in prompt when the repository is large so that they known there may be unstaged changes or untracked files that aren't shown.

If Starship were to honor bash.showDirtyState, would it work for you?

I don't see how that relates to other shells?

Starship can honor this configuration option, or it could define starship.showDirtyState with identical meaning. Would it work for you?

I'm mentioning it because this is another solution from a very popular project for the same problem you are facing.

P.S.

If you don't find this conversation productive, please tell me. I'll withdraw, no hard feelings.

@MortezaRamezani
Copy link

MortezaRamezani commented Feb 16, 2020

This is one of the feature I am looking for. In the meantime I am using something similar to what default git ps1 does:

https://github.com/git/git/blob/master/contrib/completion/git-prompt.sh#L510

I have modified starship, where you can set the bash.shoDirtyState true or false for each repo or entirely by setting "GIT_PS1_SHOWDIRTYSTATE".

+++ b/src/modules/git_status.rs
@@ -190,6 +190,16 @@ fn create_segment_with_count<'a>(
 fn get_repo_status(repository: &Repository) -> Result<RepoStatus, git2::Error> {
     let mut status_options = git2::StatusOptions::new();

+    match repository.config()?.get_bool("bash.showDirtyState") {
+        Ok(entry) => {
+          if !entry {
+            return Err(git2::Error::from_str("Repo has no status"));
+          }
+        }
+
+        _ => {}
+    }

@jaredwy
Copy link
Author

jaredwy commented Feb 16, 2020

I know zsh supports async prompt updates, but does bash/powershell etc? Im not too sure.

Fair enough. You have no obligating to help users of other shells. I respect that.

Well I use powershell as well, so need it to work there at least :)

If Starship were to skip workdir scan for Git repositories with over 100000 files, would it work for you?

This could also work, but it seems rather non deterministic to the user. i.e. they add that 100000 and suddenly their prompt changes. Not a great user experience.

Your PR adds a configuration option to Starship. This is a different configuration option that could be added to Starship. The question is which one is better.

Specifying the upper limit on repository size has the following advantages:

  • The user specifies just one number in their config which defines what it means for a Git repository to be "large". They don't have to list all large repositories one by one.
  • The user still gets most parts of Git status even in large repositories: branch, ahead/behind, staged changes, etc.
  • The user gets an indication in prompt when the repository is large so that they known there may be unstaged changes or untracked files that aren't shown.

For me it comes with a disadvantage of having to now go to that repo, perform ls-files get the count and update. With my solution you just add the directory you know is slow.

If Starship were to honor bash.showDirtyState, would it work for you?

I don't see how that relates to other shells?

Starship can honor this configuration option, or it could define starship.showDirtyState with identical meaning. Would it work for you?

I'm mentioning it because this is another solution from a very popular project for the same problem you are facing.

The issue that I have with this approach is now you have split the config. I now have to set some options in starship, others in my git config. I would favour a solution that allows me to keep all my config in the one spot. This would aid in portability for me. E.g. I put all my source in the same directory across every machine. I know that repo A will be slow. So even if I setup a new machine, i just have to pull down my startship config and i can clone the repo and have it work. Where with this solution I would need to set a gitconfig after a clone.

P.S.

If you don't find this conversation productive, please tell me. I'll withdraw, no hard feelings.

@romkatv
Copy link

romkatv commented Feb 16, 2020

For me it comes with a disadvantage of having to now go to that repo, perform ls-files get the count and update. With my solution you just add the directory you know is slow.

Hm... Don't you now have to go to that repo, perform pwd to get the current directory and update? The difference is that you have to do this for every repo, while with the repo-size-threshold parameter you specify the parameter just once. The value of the parameter defines your tolerance for prompt latency. Another huge advantage is that you get Git status in all repositories.

I think we both understand each other's points. I believe that options that are used by other successful projects work very well. You believe that your PR works very well. Let's leave it at that.

@matchai
Copy link
Member

matchai commented Mar 12, 2020

Given how much this has become a need as of late, I'd be open to seeing this solution through. 😊
@jaredwy Are you still interested in working on the refinements mentioned in this thread?

@jaredwy
Copy link
Author

jaredwy commented Mar 12, 2020

@matchai happy to get this across the line.

Which refinments would you like to see particularly and happy to do the work. Been a lot of discussions, so just not sure which direction you would like to see it go.

@matchai
Copy link
Member

matchai commented Mar 12, 2020

I'd be inclined to experiment with the solution proposed by @romkatv, using ls-files to have a sensible cut-off number for repos that are likely to be slow. It seems like the best of both worlds – configurable but useful to those who don't configure it.

@jaredwy
Copy link
Author

jaredwy commented Mar 12, 2020

Happy to wire that up.

just want to register my concern: This feels like it could be a bit surprising to users.

Imagine, starship is working fine, you do a fetch and suddenly part of your prompt disappears. It isn't immediately obvious why without reading docs why you lost status.

@matchai
Copy link
Member

matchai commented Mar 12, 2020

It isn't immediately obvious why without reading docs why you lost status.

Yeah, that's not so great. 😕
Unfortunately, it seems it's between that, and waiting seconds (and sometimes even minutes) for the prompt to appear. Unless we can implement a timeout mechanism in addition, I'd be worried that we are rendering the prompt unresponsive.

It seems to me like the lesser of two evils, but I could be convinced otherwise.

@romkatv
Copy link

romkatv commented Mar 12, 2020

just want to register my concern: This feels like it could be a bit surprising to users.

Imagine, starship is working fine, you do a fetch and suddenly part of your prompt disappears. It isn't immediately obvious why without reading docs why you lost status.

Not showing Git status when the size of a repository is above some threshold would indeed be surprising and unhelpful. I would recommend against this approach.

In Powerlevel10k the following algorithm has worked great:

  1. Always show things whose computation time doesn't depend on index size. These include branch name, the number of commits ahead/behind the remote, etc.
  2. Do not compute things whose computation time depends on index size if the number of files in the index is above a configurable threshold with some fairly high default.
  3. When (2) kicks in, show some indicator in prompt to let users know that the repository is large and that it may or may not be dirty. The icon displayed in this case can be configurable.

@chipbuster
Copy link
Contributor

The contents of romkatv's gist from this comment, in case the gist ever goes offline later.

  • Source: feat: Allow disabling git_status for large/slow repos #921.
  • Timing for git status: between 12 and 15 seconds.
    • Method of timing: likely time git status.
    • Reported time: probably wall time.
    • Cold or hot: unknown.
  • Timing for gitstatus_query: between 3 and 4 seconds.
    • Method of timing: unknown.
    • Reported time: probably wall time.
    • Cold or hot: unknown.
    • gitstatusd command-line arguments: unknown.
  • Repository state: clean, soon after git gc, possibly with ignored files.
  • Output of git ls-files --others --ignored --exclude-standard | wc -l: unknown.
  • Output of git ls-files | wc -l: likely over 1M and probably less than 30M.
  • Output of git tag | wc -l: unknown.
  • Timing for git describe --tags --exact-match: unknown.
  • OS: macOS or Windows (unknown whether WSL, WSL2, Cygwin or MSYS2).
  • Storage device: SSD (unknown whether SATA or NVME).
  • File system: unknown.
  • CPU: unknown.

@chipbuster
Copy link
Contributor

Since this PR was created we've had a lot of focus put into git performance: we've moved from using libgit2 to parsing git and using gitoxide, and we've introduced command timeouts for slow repositories.

While it's possible the approach here might still be necessary in the end, I'm going to go ahead and close this for now as a way to reduce noise while we focus on improving our gitoxide perf.

The discussions about interface and what's expected on this PR have been invaluable---thank you to everyone who contributed!

@chipbuster chipbuster closed this Oct 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants