Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upInvestigate caching .git data on Travis/AppVeyor #40772
Comments
alexcrichton
added
A-infrastructure
E-help-wanted
labels
Mar 23, 2017
This comment has been minimized.
This comment has been minimized.
|
On the Travis side, it should just be a matter of enumerating all of the directories you want: https://docs.travis-ci.com/user/caching/#Arbitrary-directories |
This comment has been minimized.
This comment has been minimized.
|
Yeah unfortunately I don't actually know what directories are cached here. Is it just If it's just |
This comment has been minimized.
This comment has been minimized.
|
I've made a stab in #40780. It's not as simple as caching the .git directory - by the time the script runs, you already have a branch checked out and it's not really clear how you'd move the objects around. It's also not as simple as just copying Instead, I've tried to go the route of using the As an aside, I'm a bit suspicious that
Seems like it has to fetch the whole thing anyway in order to get my branch? Maybe worth raising an issue on the travis issue tracker to ask if this is expected? |
This comment has been minimized.
This comment has been minimized.
|
FWIW, CircleCI is a bit vague about how their git caching works: https://circleci.com/docs/1.0/how-cache-works/#git-cache
This opacity combined with complaints about the caching ([1], [2]) is not reassuring. I could hazard some guesses at how they're doing it, but it's likely going to be more complicated and therefore less reliable than an approach tailored to the rust repo. |
This comment has been minimized.
This comment has been minimized.
|
We used to do this on buildbot didn't we? IIRC corrupted .git was pretty common occurrence. |
This comment has been minimized.
This comment has been minimized.
|
I assume you're referring to #34595, which looks like it happens because the cache may contain a partially cloned/corrupt git submodule. This was a problem even without caching (network failure then retry is unhappy because of this same corrupted state) and was seemingly resolved by deinit before update (#39055) - the same fix may have fixed the buildbots (had they still been around). That said, I can think of two approaches for validating before caching off the top of my head (the second problem with the buildbots, aside from not doing deinit, was caching continuously rather than only when the cache was valid) - if we see issues (or you want a more paranoid approach to begin with) I'll be able to put something in place fairly quickly. |
This comment has been minimized.
This comment has been minimized.
|
Thanks for the investigation @aidanhs! I'll take a look at #40780 soon. Also yeah @nagisa I'd want to be careful about a solution here. Bad caching can cause unending problems, so I'd want to make sure we're always in a situation where the tool we're caching is relatively robust to odd cache entries. For example |
This comment has been minimized.
This comment has been minimized.
|
Call me stupid (since I probably am missing something obvious), but why do Travis CI and Circle CI workers even download the commit history? GitHub has the "download a tarball from https://github.com/rust-lang/rust/archive/master.tar.gz" option, which is probably way faster. |
This comment has been minimized.
This comment has been minimized.
|
@notriddle if you already have a repo of N commits, downloading the N+1th commit with That said, Travis definitely has something sub-optimal with cloning - it should be doing a shallow clone of the branch, but instead it does a shallow clone of master and then a full clone of PR branches (defeating the point of the shallow clone). |
frewsxcv
added a commit
to frewsxcv/rust
that referenced
this issue
Mar 29, 2017
This comment has been minimized.
This comment has been minimized.
|
After a bit of a shaky start (merge completed successfully, then subsequent merges failed because appveyor caching was broken) just the appveyor part was rolled back and so just travis builds are currently using the new repo caching and it seems to be working ok. In the middle of trying to fix appveyor last night, I realised that there is a way for me to test appveyor - fork rust, then comment out all CI that actually does any rust compilation etc. That way I'll be able to just test the caching. I'll be back soon with a tested PR for appveyor. |
This comment has been minimized.
This comment has been minimized.
|
Thanksd for the continuing investigation @aidanhs! |
alexcrichton
closed this
Mar 30, 2017
alexcrichton
reopened this
Mar 30, 2017
This comment has been minimized.
This comment has been minimized.
|
(oops didn't mean to close) |
This comment has been minimized.
This comment has been minimized.
|
Unfortunately, I was completely unable to reproduce the cache restore failure on appveyor, despite trying a number of times. However, after reviewing the logs again, I don't actually think the cache restore failure was the issue, I think it was a different buggy part of the appveyor.yml. There's a new PR to re-enable appveyor at #41075. Cache aside, as part of this issue I do think it's worth looking into this sequence of commands on travis:
I described this above:
The extra time spent here makes the cache a little less effective (40s wasted), and consumes about 60-75% of the time on the no-op PR builds - that's ~ 35x45s across Linux no-op builds and ~ 5x190s across OSX no-op builds, totalling 40-45mins of dead time per PR push! Seems massively wasteful, even if it is in parallel. I've stumbled across something that looks like the ideal solution - appveyor lets you implement a custom |
This comment has been minimized.
This comment has been minimized.
|
Ok, the inefficiency in travis has been spotted before - travis-ci/travis-ci#6183, travis-ci/travis-build#747. I guess it just needs resurrecting and fixing. Appveyor seems to do it like so:
|
This comment has been minimized.
This comment has been minimized.
|
@aidanhs thanks for the investigation! Should we file an upstream travis bug for that? |
This comment has been minimized.
This comment has been minimized.
|
@alexcrichton nah I'll just make a PR (resurrecting the one I linked above) at some point in the next week or so. Well, you can raise an issue if you want one for tracking purposes :) |
This comment has been minimized.
This comment has been minimized.
|
Sounds good to me, thanks! |
This comment has been minimized.
This comment has been minimized.
|
Some updates:
|
This comment has been minimized.
This comment has been minimized.
|
Some updates since I've paused work on this (partially to rethink, see 1b):
|
aidanhs
referenced this issue
May 7, 2017
Open
Tracking issue for spurious network failures on bots #40474
kennytm
referenced this issue
May 10, 2017
Merged
Add an in-place rotate method for slices to libcore #41670
This comment has been minimized.
This comment has been minimized.
|
Raised http://help.appveyor.com/discussions/problems/6735-corrupt-caches about the corrupt caches. We've had to implement @notriddle's suggestion to 'temporarily' use .tar.gz files for the llvm submodule in #42211 since it was causing builds to timeout when combined with the current appveyor network issues. I'm not delighted, but while caches don't work there aren't really any other great options I'm aware of. |
Mark-Simulacrum
added
T-infra
and removed
A-infrastructure
labels
Jun 25, 2017
This comment has been minimized.
This comment has been minimized.
|
Appveyor have replied to the issue effectively acknowledging the problem and saying it'll be fixed with cache "v2". |
Mark-Simulacrum
added
the
C-tracking-issue
label
Jul 27, 2017
This comment has been minimized.
This comment has been minimized.
|
Triage: not aware of any movement here |
alexcrichton commentedMar 23, 2017
The rust-lang/rust repo itself takes awhile to clone but we somewhat mitigate that with
--depth=1clones. Our submodules, however, are much larger and unfortunately cannot be cloned with a--depthargument due to how the branches work. This typically means that cloning the LLVM repo takes quite a long time! Unfortunately this also increases our chances to network problems by requiring a lot of data to move over the network.When playing around with CircleCI recently I found that they automatically cached git repository data which greatly sped up cloning the repository and checking out submodules. Overall it felt quite nifty! We should investigate to see if a similar strategy can apply to Travis and/or AppVeyor. I'm not personally familiar with how CircleCI's git caching works, so some investigation there would be needed (and comments if you're familiar with it would be most welcome!)
Overall I would expec this change to:
Any help to implement this would be very much appreciated!