Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge repository - maintaining binaries in git #455

Closed
agherzan opened this issue Aug 10, 2015 · 32 comments
Closed

Huge repository - maintaining binaries in git #455

agherzan opened this issue Aug 10, 2015 · 32 comments

Comments

@agherzan
Copy link

Hello,

We are using this repository for bringing in boot binaries for raspberrypi platforms in yocto build system. Cloning this started to be a real issue as this repository already got over 4GB (4345.67MB). Generally it doesn't make sense to track binaries under git because the git history is getting huge for no advantage. Is raspberry able to provide the firmware "releases" as archives? That would save a lot of useless download time for our users.

Regards,
Andrei

@agherzan agherzan changed the title Huge repository maintaining binaries in git Huge repository - maintaining binaries in git Aug 10, 2015
@kraj
Copy link

kraj commented Aug 10, 2015

Its really not history but the metadata itself that will get bloated at higher pace.
I will second this since I dont see a point in using git for binary blobs its not a tool for binary blobs, especially large ones. Since every version added will be stored as whole blob unlike srcs which is stored as deltas and uses good compression. So I am just seeing this repo going out of proportions with time as more versions are added. fetching 4Gb is not for feeble connections either.

Instead if this was released in discrete tarballs it would be much easier to consume it.

@agherzan
Copy link
Author

Agreed: metadata indeed.

@kraj
Copy link

kraj commented Aug 11, 2015

yeah I think that would work as long as the repo is not rebased. I have some reports where this exporting as archives caused problems on OpenEmbedded, but I have to dig the email. It certainly did not mention the root cause of the issue though.

@MilhouseVH
Copy link

This issue seems to come up often, have you tried the suggestion here?

@kraj
Copy link

kraj commented Aug 11, 2015

@MilhouseVH this wont work out of box with OpenEmbedded, support for shallow clones is not there in fetchers and it might also cause to depend on certain version of git probably.

@XECDesign
Copy link
Contributor

in the future, stable releases will be tagged, which I think will also put them on the releases page here. Does that solve the problem?

@MilhouseVH
Copy link

@MilhouseVH this wont work out of box with OpenEmbedded, support for shallow clones is not there in fetchers and it might also cause to depend on certain version of git probably.

So maybe OpenEmbedded needs fixing? ;)

@ghost
Copy link

ghost commented Aug 11, 2015

Would this solve the problem? https://git-lfs.github.com/

@agherzan
Copy link
Author

@MilhouseVH Implementing shallow clones is something that can be done in oe-core but, it's not feasible again as it depends on git version. So, even if we would support it in oe-core, we won't be able to use it as a general solution for users.

@XECDesign The problem that @kraj was talking about was the fact that Github can and will regenerate these tarballs whenever it feels like it so the archives' checksum will change over time. Given that oe-core is using checksums to validate remote fetch, we will need to periodically update these checksums which would be a pain.

@popcornmix
Copy link
Contributor

We don't rebase the firmware repo. I'm not sure why archives' checksums would change over time.

@MilhouseVH
Copy link

@MilhouseVH Implementing shallow clones is something that can be done in oe-core but, it's not feasible again as it depends on git version.

git clone --depth has been supported for at least 6 years... perhaps if a user is running such an outdated version of git that doesn't support shallow clones they should just accept the hit (in which case they're no worse off than they are today), but at least users with more recent versions of git can benefit. This would seem a more pragmatic solution than trying to force someone else to reorg their repo for your benefit.

@kraj
Copy link

kraj commented Aug 11, 2015

@MilhouseVH irrespective of anything, this repo will grow to a large size at accelerated pace as it has huge binaries as part of it, so you have to document ways to clone this repo in the ways that git is not generally used. github does support SVN, may be SVN would be a better choice for the nature of this repo.

@nbd168
Copy link

nbd168 commented Dec 16, 2015

I think the only sane solution is if somebody creates a filtered and automatically maintained fork of this repo which only includes the important stuff like the firmware binaries, leaving out all this other nonsense that is filling the repository with insane amounts of bloat.

Unfortunately it has to be a fork, because upstream has shown that they don't give a crap about automated build systems relying on the firmware.

OpenWrt is currently using this repo: https://github.com/Hexxeh/rpi-firmware/, but even that one (though more harmless than upstream) is filled with quite a bit of useless garbage.

@Ferroin
Copy link

Ferroin commented Dec 16, 2015

That still doesn't solve the real issue, which is that A VCS which is designed for source code is being used for binary files larger than a few kilobytes which are changing on a regular basis. Upstream should be doing proper releases instead of using VCS for storing files which shouldn't be patched. Eventually, this repository will get big enough that GitHub itself will start having issues with it, and then we're all screwed.

@agherzan
Copy link
Author

@Ferroin Very good point.

@net147
Copy link

net147 commented Mar 6, 2016

Any update on this? There is a releases section but the archives are dynamically generated by the server and the checksum may change in the future. A git clone is now over 5 GB. Ideally the automatically generated source archive on the releases page should be downloaded and then uploaded back to GitHub when a release is tagged (a manually uploaded release will show the file size next to the download link - see example at https://github.com/DynamicDevices/bbexample/releases). The checksum of manually uploaded archives will not change.

@cleverca22
Copy link

nixos handles the hashes changing, by hashing the contents of the tar, not the tar itself (it unpacks, sorts the files by name, then hashes them in order)

@agherzan
Copy link
Author

agherzan commented Mar 7, 2016

Would it be a solution for rpi foundation to not have these releases under git but have them dropped on a local server and serve them over http? I still don't get the why having these binaries under a revision control system was chosen.

@Ferroin
Copy link

Ferroin commented Mar 8, 2016

Almost anything would be better than having them under traditional VCS. The point of VCS is so you can see the changes between versions (so the sources should be under VCS), but what is needed here is just to have a reliable static tag for identifying a particular version. Ideally, such a tag should be a lot more human friendly than a git hash (a192a05 means nothing unless you have the git repo, and even then it still means nothing for most people).

@lurch
Copy link
Contributor

lurch commented Mar 8, 2016

Ideally, such a tag should be a lot more human friendly than a git hash

There's been over 600 commits to this repo, so I'm not sure that any versioning-scheme that supported that many revisions would be any more 'human friendly' than a git hash?

@Ferroin
Copy link

Ferroin commented Mar 8, 2016

An ISO date plus a tag to differentiate testing versions and multiple releases in the same day would be exponentially more human friendly, and would also be machine parseable without needing git.

Think something like:
2014-01-30
And for a second release that day:
2015-07-03.r2
Or for a pre-release test version for that day:
2016-11-12.p1

It obviously doesn't have to be exactly like that, but it's not hard to come up with more human friendly alternatives to git hashes for versioning. Even a simple monotonically increasing revision number is more human friendly than a truncated SHA1 hash.

@nbd168
Copy link

nbd168 commented Mar 8, 2016

All this discussion about different version control systems and their ability to handle binaries is pretty pointless. The main issue is still that a few small files which many projects need (and for which this repo is the main upstream) are mixed in the same repo with a truckload of garbage.

The important binaries are small and don't change very often. Even if they did, the repository size wouldn't be a problem if they were separate.

@Ferroin
Copy link

Ferroin commented Mar 8, 2016

Unless some alternative can be determined, nothing is going to change. My statement from December of last year still stands though, this will blow up eventually, it would just take longer to do so if things get split out like you propose (splitting things out is not a bad idea, it's just not a complete solution).

@lurch
Copy link
Contributor

lurch commented Mar 8, 2016

If someone has the hosting space (and bandwidth) I guess they could use git archive to setup their 'stable' archives named whatever they want, with files named/tagged however they want? I did a bit of experimentation last night, and git is easily scriptable, e.g. you could do something like

REPO=firmware
export REPO
APATH=boot
export APATH
archive_commit() {
  COMMIT_ID="$1"
  FILENAME="${REPO}-${APATH}-${COMMIT_ID}.tar.gz"
  if [[ ! -e "../$FILENAME" ]]; then
    echo "Archiving $APATH from $REPO commit $COMMIT_ID"
    git archive -o "../$FILENAME" $COMMIT_ID "$APATH"
    cd ..
    for CKSUM in md5 sha1 sha256; do
       "${CKSUM}sum" -b "$FILENAME" > "$FILENAME.$CKSUM"
    done
    cd "$REPO"
  fi
}
export -f archive_commit
git clone --depth=10 "https://github.com/raspberrypi/$REPO"
cd "$REPO"
git log --format="format:%H" | parallel -u archive_commit {}
cd ..

@Ferroin
Copy link

Ferroin commented Mar 8, 2016

They still need to get away from Git for the firmware files. The firmware repo takes more than half an hour on my laptop to clone fully on a good network (2.5MiB/s average), and takes almost 2 minutes to pull a single commit. It takes up a total of 5.2G, with 5G of that being git history, and consists of 628 commits (based on git log --all). This means 96% of it is git history, and each commit takes up on average 8.45k. All of this is significantly above average given the number of commits. I'm actually somewhat amazed that it's still manageable to the degree it is.

By comparison, my local copy of the upstream linux-stable repository takes about 15 minutes for a full clone,on the same system, about 30 seconds for --depth=1, takes up 1.7G of space total, with 991M of that being git history, and consists of 11 469 075 commits. This means that about 58% of it is git history, and each commit takes on average 90 bytes. It has tens of millions more commits than the firmware repo, and is less than half of the size, because source code and text files are what Git is supposed to be used for.

@lurch
Copy link
Contributor

lurch commented Mar 9, 2016

Interesting statistics. Just out of curiosity, any idea why a local clone of https://github.com/raspberrypi/linux.git takes up 2.2G of diskspace for 605717 commits?

I did a git gc in my clone of the firmware repo and it knocked it down to 5.0G. And (again, out of curiosity) I did git gc --aggressive and it took absolutely ages, and chewed through 7GB of RAM (!), but knocked it down a bit further to 4.0GB of diskspace.

@Ferroin
Copy link

Ferroin commented Mar 9, 2016

@lurch I'm not entirely certain, although it's worth noting the commit numbers I gave were calculated using this:
git log --oneline --all | wc -l
I get significantly smaller numbers if I don't pass --all to git (around 530k for the repo I gave for comparison, as opposed to the 11m I get with --all).

Also, for what it's worth, that kind of RAM usage is somewhat typical for git gc --aggressive, part of what it's doing is re-compressing the git history data, which is why it has such an impact on size (although the firmware repo is mostly incompressible data, which is part of why it's so big to begin with).

@ChristopherUNIS
Copy link

as @wm4 suggested, Git LFS (https://git-lfs.github.com/) would definitely help here, it is made to handle binaries within a git repo

@agherzan
Copy link
Author

agherzan commented Feb 12, 2022

It looks like this issue can be closed as won't fix. I'll close it for now but feel free to resurrect it if still planned in the future.

@lurch
Copy link
Contributor

lurch commented Feb 12, 2022

If you're happy with only using stable versions, rather than always being able to access the latest bleeding-edge version, I guess another option would be downloading one of the .orig tarballs from https://archive.raspberrypi.com/debian/pool/main/r/raspberrypi-firmware/ ?

@agherzan
Copy link
Author

That is a good idea. I think this would be the best for integration with the build system. I'll try it and see how it goes. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests