New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Huge repository - maintaining binaries in git #455
Comments
Its really not history but the metadata itself that will get bloated at higher pace. Instead if this was released in discrete tarballs it would be much easier to consume it. |
Agreed: metadata indeed. |
yeah I think that would work as long as the repo is not rebased. I have some reports where this exporting as archives caused problems on OpenEmbedded, but I have to dig the email. It certainly did not mention the root cause of the issue though. |
This issue seems to come up often, have you tried the suggestion here? |
@MilhouseVH this wont work out of box with OpenEmbedded, support for shallow clones is not there in fetchers and it might also cause to depend on certain version of git probably. |
in the future, stable releases will be tagged, which I think will also put them on the releases page here. Does that solve the problem? |
So maybe OpenEmbedded needs fixing? ;) |
Would this solve the problem? https://git-lfs.github.com/ |
@MilhouseVH Implementing shallow clones is something that can be done in oe-core but, it's not feasible again as it depends on git version. So, even if we would support it in oe-core, we won't be able to use it as a general solution for users. @XECDesign The problem that @kraj was talking about was the fact that Github can and will regenerate these tarballs whenever it feels like it so the archives' checksum will change over time. Given that oe-core is using checksums to validate remote fetch, we will need to periodically update these checksums which would be a pain. |
We don't rebase the firmware repo. I'm not sure why archives' checksums would change over time. |
|
@MilhouseVH irrespective of anything, this repo will grow to a large size at accelerated pace as it has huge binaries as part of it, so you have to document ways to clone this repo in the ways that git is not generally used. github does support SVN, may be SVN would be a better choice for the nature of this repo. |
I think the only sane solution is if somebody creates a filtered and automatically maintained fork of this repo which only includes the important stuff like the firmware binaries, leaving out all this other nonsense that is filling the repository with insane amounts of bloat. Unfortunately it has to be a fork, because upstream has shown that they don't give a crap about automated build systems relying on the firmware. OpenWrt is currently using this repo: https://github.com/Hexxeh/rpi-firmware/, but even that one (though more harmless than upstream) is filled with quite a bit of useless garbage. |
That still doesn't solve the real issue, which is that A VCS which is designed for source code is being used for binary files larger than a few kilobytes which are changing on a regular basis. Upstream should be doing proper releases instead of using VCS for storing files which shouldn't be patched. Eventually, this repository will get big enough that GitHub itself will start having issues with it, and then we're all screwed. |
@Ferroin Very good point. |
Any update on this? There is a releases section but the archives are dynamically generated by the server and the checksum may change in the future. A git clone is now over 5 GB. Ideally the automatically generated source archive on the releases page should be downloaded and then uploaded back to GitHub when a release is tagged (a manually uploaded release will show the file size next to the download link - see example at https://github.com/DynamicDevices/bbexample/releases). The checksum of manually uploaded archives will not change. |
nixos handles the hashes changing, by hashing the contents of the tar, not the tar itself (it unpacks, sorts the files by name, then hashes them in order) |
Would it be a solution for rpi foundation to not have these releases under git but have them dropped on a local server and serve them over http? I still don't get the why having these binaries under a revision control system was chosen. |
Almost anything would be better than having them under traditional VCS. The point of VCS is so you can see the changes between versions (so the sources should be under VCS), but what is needed here is just to have a reliable static tag for identifying a particular version. Ideally, such a tag should be a lot more human friendly than a git hash (a192a05 means nothing unless you have the git repo, and even then it still means nothing for most people). |
There's been over 600 commits to this repo, so I'm not sure that any versioning-scheme that supported that many revisions would be any more 'human friendly' than a git hash? |
An ISO date plus a tag to differentiate testing versions and multiple releases in the same day would be exponentially more human friendly, and would also be machine parseable without needing git. Think something like: It obviously doesn't have to be exactly like that, but it's not hard to come up with more human friendly alternatives to git hashes for versioning. Even a simple monotonically increasing revision number is more human friendly than a truncated SHA1 hash. |
All this discussion about different version control systems and their ability to handle binaries is pretty pointless. The main issue is still that a few small files which many projects need (and for which this repo is the main upstream) are mixed in the same repo with a truckload of garbage. The important binaries are small and don't change very often. Even if they did, the repository size wouldn't be a problem if they were separate. |
Unless some alternative can be determined, nothing is going to change. My statement from December of last year still stands though, this will blow up eventually, it would just take longer to do so if things get split out like you propose (splitting things out is not a bad idea, it's just not a complete solution). |
If someone has the hosting space (and bandwidth) I guess they could use git archive to setup their 'stable' archives named whatever they want, with files named/tagged however they want? I did a bit of experimentation last night, and REPO=firmware
export REPO
APATH=boot
export APATH
archive_commit() {
COMMIT_ID="$1"
FILENAME="${REPO}-${APATH}-${COMMIT_ID}.tar.gz"
if [[ ! -e "../$FILENAME" ]]; then
echo "Archiving $APATH from $REPO commit $COMMIT_ID"
git archive -o "../$FILENAME" $COMMIT_ID "$APATH"
cd ..
for CKSUM in md5 sha1 sha256; do
"${CKSUM}sum" -b "$FILENAME" > "$FILENAME.$CKSUM"
done
cd "$REPO"
fi
}
export -f archive_commit
git clone --depth=10 "https://github.com/raspberrypi/$REPO"
cd "$REPO"
git log --format="format:%H" | parallel -u archive_commit {}
cd .. |
They still need to get away from Git for the firmware files. The firmware repo takes more than half an hour on my laptop to clone fully on a good network (2.5MiB/s average), and takes almost 2 minutes to pull a single commit. It takes up a total of 5.2G, with 5G of that being git history, and consists of 628 commits (based on By comparison, my local copy of the upstream linux-stable repository takes about 15 minutes for a full clone,on the same system, about 30 seconds for --depth=1, takes up 1.7G of space total, with 991M of that being git history, and consists of 11 469 075 commits. This means that about 58% of it is git history, and each commit takes on average 90 bytes. It has tens of millions more commits than the firmware repo, and is less than half of the size, because source code and text files are what Git is supposed to be used for. |
Interesting statistics. Just out of curiosity, any idea why a local clone of https://github.com/raspberrypi/linux.git takes up 2.2G of diskspace for 605717 commits? I did a |
@lurch I'm not entirely certain, although it's worth noting the commit numbers I gave were calculated using this: Also, for what it's worth, that kind of RAM usage is somewhat typical for |
as @wm4 suggested, Git LFS (https://git-lfs.github.com/) would definitely help here, it is made to handle binaries within a git repo |
It looks like this issue can be closed as |
If you're happy with only using stable versions, rather than always being able to access the latest bleeding-edge version, I guess another option would be downloading one of the |
That is a good idea. I think this would be the best for integration with the build system. I'll try it and see how it goes. Thanks. |
Hello,
We are using this repository for bringing in boot binaries for raspberrypi platforms in yocto build system. Cloning this started to be a real issue as this repository already got over 4GB (4345.67MB). Generally it doesn't make sense to track binaries under git because the git history is getting huge for no advantage. Is raspberry able to provide the firmware "releases" as archives? That would save a lot of useless download time for our users.
Regards,
Andrei
The text was updated successfully, but these errors were encountered: