Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

git submodules are not cached #7987

Open
ehuss opened this issue Mar 11, 2020 · 5 comments
Open

git submodules are not cached #7987

ehuss opened this issue Mar 11, 2020 · 5 comments
Labels
A-caching Area: caching of dependencies, repositories, and build artifacts A-git Area: anything dealing with git A-networking Area: networking issues, curl, etc. C-bug Category: bug S-needs-mentor Status: Issue or feature is accepted, but needs a team member to commit to helping and reviewing.

Comments

@ehuss
Copy link
Contributor

ehuss commented Mar 11, 2020

Problem
If a package has a git dependency with a large submodule, any change to the git repo that updates the submodule causes the entire submodule repo to be re-downloaded from scratch, and an entire separate copy is retained. This can be very expensive for both network download time and disk space.

Steps

  1. In a blank project add dependency: rocksdb = {git = "https://github.com/tikv/rust-rocksdb.git", rev="fe7be35ba191684c989effdc6ee8e39a3978e650"}
  2. cargo fetch
  3. Change rev to 3cd18c44d160a3cdba586d6502d51b7cc67efc59
  4. cargo fetch
  5. Notice it downloaded the entirety of the submodule https://github.com/tikv/rocksdb.git which is about 100MB.
  6. Change rev to 5adf5b847e13cea2a59a1b4921aa5bf38591d1a3
  7. cargo fetch
  8. Notice it downloaded yet another copy.

Possible Solution(s)
The repo in git/db/… should probably contain the submodule. Currently it appears that it checks out a fresh copy for every commit in git/checkout/…. I think it is because cargo is using Submodule::open here. I wonder if using Submodule::update would be the solution?

@ehuss ehuss added C-bug Category: bug A-git Area: anything dealing with git A-caching Area: caching of dependencies, repositories, and build artifacts A-networking Area: networking issues, curl, etc. labels Mar 11, 2020
@ehuss
Copy link
Contributor Author

ehuss commented Mar 12, 2020

Hm, after looking into this closer, I think I understand it better. Bare repos cannot include submodules. I don't understand why (it seems like it could just have a modules/… directory like a non-bare repo does).

I wonder if worktrees might be an option. The git-cli docs tell you NOT to do that (support is "incomplete"), so it may not be possible or too risky.

Are there any other ideas on how to share submodules across multiple checkouts?

@alexcrichton
Copy link
Member

I think that we may need to take over management of submodules away from git. Ideally they'd use the same caching/etc mechanism as main git repos, meaning we'd have entries in the database for submodules too. We'd then manually check out modules from the database into repositories or do something like a "git clone using hardlinks from that other path" or something like that, sort of how we checkout git repos from the db today which is in theory very fast.

I think we probably rely too much on native git submodule management here right now which hinders the caching? I'd be hesitant to dip our toes too much into fancy features like worktrees personally

@sjackman
Copy link

TIL about git subtree via @nlhepler, an alternative to git submodule, which should work out of the box with cargo.
See https://www.atlassian.com/git/tutorials/git-subtree

@expenses
Copy link

expenses commented Jan 1, 2022

I'm currently struggling with possibly the worst case of this issue. The gltf crate repo has the glTF-Sample-Models repo as a submodule. Across different projects I have 4 different git dependencies of this crate in use, which means that I have 4 different checkouts of the repo:

C__Users_Ashley_ cargo - WinDirStat 01_01_2022 16_53_30 (2)

Any fixes to this issue would be greatly appreciated.

@jrose-signal
Copy link
Contributor

jrose-signal commented Aug 1, 2022

Having some kind of setting or environment variable to request a shallow clone might be a good stopgap solution. Not all repository hosts support shallow clones, but enough do for this to be useful (say, in CI).

EDIT: I see #1171 talks about this for both top-level repositories and their submodules.

@epage epage added the S-triage Status: This issue is waiting on initial triage. label Nov 3, 2023
@ehuss ehuss added S-needs-mentor Status: Issue or feature is accepted, but needs a team member to commit to helping and reviewing. and removed S-triage Status: This issue is waiting on initial triage. labels Nov 22, 2023
jschwe added a commit to jschwe/fontsan that referenced this issue Feb 4, 2024
Update `ots` to the latest release and vendor ots and all dependencies.
Cargo build will recursively clone submodules for `git` dependencies.
`ots` itself and some of its dependencies have quite large git repos,
e.g. `ots` has 80MB of test font files. Vendoring the sources reduces
the required network bandwidth and disk space usage greatly.
See also rust-lang/cargo#7987 for more details
on the effects on disk usage (not super relevant for us, since we rarely
update our dependencies).

Ideally each of the C dependencies would have their own crate, but that
can be done later.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-caching Area: caching of dependencies, repositories, and build artifacts A-git Area: anything dealing with git A-networking Area: networking issues, curl, etc. C-bug Category: bug S-needs-mentor Status: Issue or feature is accepted, but needs a team member to commit to helping and reviewing.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants