Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build cache should be reusable between git checkouts #673

Closed
davewo opened this issue Mar 20, 2020 · 12 comments
Closed

Build cache should be reusable between git checkouts #673

davewo opened this issue Mar 20, 2020 · 12 comments

Comments

@davewo
Copy link

davewo commented Mar 20, 2020

Steps to reproduce:

  • checkout a project built using shadow-cljs
  • build a release (verbose output enabled to see cache misses)
  • build a release again (observe a faster build, because cache is warm)
  • update the modified time of a file included in the build (do not change its content)
  • build the release again
  • observe a cache miss for the file
@davewo
Copy link
Author

davewo commented Mar 20, 2020

Happy to help with a PR on this :-)

@superstructor
Copy link
Contributor

Another option to workaround this is to use a git hook like Danny Lin's git-store-meta to save and restore relevant metadata such as mtime.

Or there are simpler options just for mtime, like

rev=HEAD
for f in $(git ls-tree -r -t --full-name --name-only "$rev") ; do
    touch -d $(git log --pretty=format:%cI -1 "$rev" -- "$f") "$f";
done

The downsides of those are that it can be slow, and its extra complexity for users to get caching between checkouts working.

The question then is should this be something shadow-cljs fixes or something users fix with their own solutions like the above ? For example shadow-cljs might need to do hashing of all the files to work out what has modified instead of using mtime which might have performance implications for shadow-cljs itself ?

@thheller
Copy link
Owner

I think the best way to handle this would be to md5/sha1 actual files only, so it keeps the old logic for .jar files. I'm not sure how much of a perf impact this may actually have so I might keep it behind a flag.

@darwin
Copy link
Sponsor Contributor

darwin commented Mar 23, 2020

jar/zip files should contain crc32 checksums of all files in zip header. So potentially it could be pretty fast to check that. Just my $.02

@thheller
Copy link
Owner

Interesting. I don't know much about crc32 but does the same content always produce the same checksum? So if you have the same file but in thing-0.1.jar and thing-0.2.jar do they have the same checksum? Doesn't really matter, just curious.

@darwin
Copy link
Sponsor Contributor

darwin commented Mar 24, 2020

Yes, AFAIK. The only problem could be collisions since the checksum is only 32bits. Meaning that you have two files with different content but giving you the same CRC and you decide to use cached version. But I think the likelihood is very slim because we are dealing here with text files and nobody is going to attack this system with collisions. Also you would be comparing multiple CRCs (of a list of files), so under normal circumstances the change is even lower when touching multiple files.

I would go through list of files in the zip header and hash all filenames and CRC fileds into one md5/sha1 and use that as content hash (maybe also include other userful fields from the header like file size?). Filenames should be included because you want to cover cases when people rename files without changing their content, I guess.

@thheller
Copy link
Owner

It should be fine as long as the CRC is unique enough that it doesn't clash with similar content. It is never used as a lookup value and only used in combination with the same file. So the only problem would be if two "similar" files result in the same CRC. Dunno how likely it is to get the same CRC with small changes like changing foo to bar.

I can just md5/sha1 the whole jar file as a whole too. It is fine to invalidate everything in that file if that hash changes, no need to test individual files.

@thheller
Copy link
Owner

I released 2.8.103 with ebd27e4 which now should use the sha1 hash of the source file for caching purposes.

I only tested this briefly though so would be nice if someone could confirm that actually does what its supposed to in actual CI environments.

@superstructor
Copy link
Contributor

superstructor commented Apr 27, 2020

I've tested this and it works with GitHub Actions for two internal (private repo) apps. Reduced compile times from ~46s to ~8s which is great 👍 😄 Thanks @thheller !

However I cannot get it to work for re-frame-10x todomvc example which has a similar configuration so still working out what the difference is with that.

@thheller
Copy link
Owner

@superstructor if I read this right this uses the git sha as the cache key? Doesn't that invalidate the cache for each commit?

https://github.com/day8/re-frame-10x/blob/e7ff832a3184a12d2344582daa653c5bb6c0b144/.github/workflows/continuous-integration-workflow.yml#L39

@superstructor
Copy link
Contributor

Partially correct but a key point is the addition of a restore-keys pattern in combination with the cache key.

Using a git sha as the cache key ensures that the latest cache is always uploaded with GitHub Actions. One could probably use something that changes less often than the git sha such as a hash of all source files, compiler config, package.json etc to decide if to upload the cache from the current build. It is just that so many things could influence the state of the cache I opted for always uploading it regardless.

Then using ${{ runner.os }}-shadow-cljs- (without the sha) as a restore key matches any sha (it is like ${{ runner.os }}-shadow-cljs-* but you don't need the wildcard) so it ensures that the most recently uploaded cache is downloaded on every build.

I've confirmed .shadow-cljs is being correctly uploaded and downloaded on builds. The same configuration works on a couple of private projects significantly reducing build times. Just on the re-frame-10x example it doesn't speed up builds for some reason. I'll need to investigate more to find out if this difference is due to something I am doing or within shadow-cljs itself.

@thheller
Copy link
Owner

It is probably enough to hash everything that affects dependencies only. The cache otherwise will then already check the sha1 of everything else so uploading a cache for each commit seems overkill.

Should be fine with just shadow-cljs.edn or if you are using deps.edn or project.clj plus your package.json.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants