Files from protected commits loose their history, show up as if in last commit only #53

Open
vorburger opened this Issue Jul 24, 2014 · 5 comments

Projects

None yet

4 participants

@vorburger

Hello @rtyley , first of all, once again thanks for this amazing tool. Here's feedback of something I'm struggling with - unless I misunderstand, files from protected commits loose their history, show up as if in last commit only? Apologies if this terminology isn't 100% accurate, here's what I mean:

The use case is purging old un-used "big" (mostly binary) files from an originally big (4 GB-ish) repo resulting from a git svn clone import from Subversion. So I so something like: java -jar ../bin/bfg*.jar --private -b 512K . - works great, super fast.

As there are some files >512k on HEAD, and because "BFG assumes that your latest commit is a good one, with none of the dirty files you want removing from your history still in it." (great, tx), I obviously get some:

Scanning packfile for large blobs: 387045
Scanning packfile for large blobs completed in 2,230 ms.
Found 1089 blob ids for large blobs - biggest=653983912 smallest=262726
Total size (unpacked)=5219004150
Found 24785 objects to protect
Found 3 commit-pointing refs : HEAD, refs/heads/master, refs/remotes/git-svn

Protected commits

These are your protected commits, and so their contents will NOT be altered:

  • commit 41c3b0f9 (protected by 'HEAD') - contains 116 dirty files :
    • badaboum (479.7 KB)```

What's... "sub-optimal" is that e.g. the badaboum file in the repo now appears to have (deleted first and then) created in the last commit - it's history appears to have been lost! :( I'm sure this is for a good technical reason of the current implementation - but is there any way to fix / improve this, or any advise/trick/work around you may have? To illustrate:

git show | grep folder/badaboum
diff --git a/folder/badaboum b/folder/badaboum
+++ b/folder/badaboum
diff --git a/folder/badaboum.REMOVED.git-id b/folder/badaboum.REMOVED.git-id
--- a/folder/badaboum.REMOVED.git-id
diff --git a/folder/badaboum_template b/folder/badaboum_template
+++ b/folder/badaboum_template
diff --git a/folder/badaboum_template.REMOVED.git-id b/folder/badaboum_template.REMOVED.git-id
--- a/folder/badaboum_template.REMOVED.git-id

Ideally, I would have hope that files like badaboum just... stay wherever they are in the history. Possible?

@rtyley
Owner
rtyley commented Jul 24, 2014

So, to summarise your issue:

  • You're running the BFG to remove big files from your repository.
  • There are some big files in your HEAD commit (ie 'protected') and so the BFG is not removing those files from that commit.
  • However, those files are also in some previous commits. The BFG is removing the files from those older commits, and you'd prefer for it to not do that - you'd like the history of those files to remain intact.

This kind of question has come up before - eg in #49 (comment) and the answer is a little subtle:

  • Git really doesn't track files, it tracks content. So when the BFG 'protects a file' in your HEAD commit, it's definitely not protecting all versions of that file - to make that happen would be difficult and CPU-intensive, because Git does not model a direct link between the different versions of that file.
  • Beyond that, even if the content of the file never changes, The BFG may remove it from older commits. This is because the BFG actually performs memoization at the level of trees and commits, but not at the level of blobs - and the protection operates at that level too. So you're not protecting files (like y.txt) when you protect a commit - you're protecting folders. If
    a folder changes in any way (ie a different file changes), that is enough to remove the protection from earlier versions of that folder.

I hope that explanation makes sense. It's slightly more nuanced than I wanted to put onto the main documentation page.

@suniala
suniala commented Jul 25, 2014

I had the same problem as @vorburger and the solution I came up with was that I produced a list of blob ids I wanted to remove (about 10,000 of them in the end) and asked BFG to remove said blobs. This approach worked but I would not actually recommend it as it requires a respectable amount of scripting and manual labour. I have discussed this approach previously on #51.

@vorburger

@rtyley tx for your answer, I think I (kind of) "get it" now. @suuntala tx for chiming in, very useful & good to know I'm clearly not the only one hitting this Q; we may consider the option of using --strip-blobs-with-ids instead of --strip-blobs-bigger-than (depending on the effort it would be for us to create the "magic shell/git scripts" to produce such a list CORRECTLY.. hm) - or we'll just accept and live with this during our SVN to Git migration.

@pauldraper

I too was misled by the documentation

If something questionable - like a 10MB file, when you're telling The BFG to strip out everying over 5MB - is in a protected commit, it won't be removed, and because it's still there, there's no point deleting it from earlier commits either. If you want the BFG to delete something you need to make sure your current commits are clean.

I misread "there's no point" as "there's no point and so it won't do it".

I understand the implementation details may preclude this behavior, but I would have expected that if a file from the protected tree to be kept in earlier commits.

I understand that can be a little fuzzy. In other words, git log --follow my-file would have the same history after running BFG (except for changed SHA-1s).

@pauldraper

@rtyley, this doesn't exactly match my earlier suggestion, but this is close.

This determine the ids of large blobs except for blobs present in HEAD:

(This uses bash and unix utilities. The max size is specified by 1024 * 1024.)

comm -23 \
    <(git rev-list --objects --all | git cat-file  --batch-check="%(objecttype) %(objectname) %(objectsize) %(rest)" | grep ^blob | awk '$3 > 1024 * 1024 { print $2 }' | sort) \
    <(git ls-tree -r HEAD | cut -f 1 | cut -d ' ' -f 3 | sort) \
    > /tmp/large-blobs.list
java -jar bfg-1.12.0.jar -bi /tmp/large-blobs.list

I list all blobs, filter to those more than 1MB, subtract the blobs on HEAD, and output the ids to large-blobs.list. Then I use BFG to remove those blobs.

@javabrett javabrett added a commit to javabrett/bfg-repo-cleaner that referenced this issue May 17, 2016
@javabrett javabrett Retain protected blobs (dirt) wherever they occur in history trees, n…
…ot just in protected refs' trees. Fixed #49, #53, #138.

Added objectId exclusion filters during tree blob-cleaning, such that blobs that exist in the trees of protected refs as stored in the census (AKA dirt) are protected not only in those trees, but in any other trees in which they occur in in the walked-history.  This prevents the perception that those files are being deleted in-history and then re-added in the final re-written commit, which is the protected HEAD (with dirt) and its untouched tree.  This is more a perception because Git does not track the history of individual files, but it does show diffs and logs that indicate such changes, and the behaviour prior to this change is to remove protected blobs from non-protected history trees, retaining them only in the final HEAD ref, which then shows as an add in that commit when logs/diffs are taken.

This change is convenient if your clean-up selectors (by name, or size) do select some dirt (files still in HEADs that you want to keep) but you don't want those files to appear as if they were recently added to HEAD.
8db54ba
@javabrett javabrett added a commit to javabrett/bfg-repo-cleaner that referenced this issue May 17, 2016
@javabrett javabrett Retain protected blobs (dirt) wherever they occur in history trees, n…
…ot just in protected refs' trees. Fixed #49, #53, #138.

Added objectId exclusion filters during tree blob-cleaning, such that blobs that exist in the trees of protected refs as stored in the census (AKA dirt) are protected not only in those trees, but in any other trees in which they occur in in the walked-history.  This prevents the perception that those files are being deleted in-history and then re-added in the final re-written commit, which is the protected HEAD (with dirt) and its untouched tree.  This is more a perception because Git does not track the history of individual files, but it does show diffs and logs that indicate such changes, and the behaviour prior to this change is to remove protected blobs from non-protected history trees, retaining them only in the final HEAD ref, which then shows as an add in that commit when logs/diffs are taken.

This change is convenient if your clean-up selectors (by name, or size) do select some dirt (files still in HEADs that you want to keep) but you don't want those files to appear as if they were recently added to HEAD.
2a9387d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment