Feature request/idea: dry-run mode #17

tfnico · 2013-04-24T09:38:54Z

In order to preview which files would be removed with -b, I first used some perl script to see which files would be deleted. However, it would seem practical if BFG could be run in dry-mode to see what the output would be, without actually doing any changes in the repo.

Of course, it's also easy to just make another clone to do the test-run first. But if it's easy to implement dry-run, why not.

rtyley · 2013-04-24T14:39:19Z

Hi there @tfnico - hmm, my reply to this got surprisingly long, which is weird given how simple the feature sounds (I guess this is probably an indication of how obsessive I am about this stuff at the moment).

Ok, to define the feature story:

As a user, I'd like to be able to get useful feedback on what the BFG would do if executed with the supplied
settings, but without the BFG actually changing the state of the repo ...so that I feel more confident about
experimenting with the BFG, and (ideally) enjoy a faster feedback loop than having to delete the result of my
experiment and re-grab a copy of the original repo every time I try something out.

It's possible to do a small imperfect chunk of this without any problem - if we're talking specifically about the -b switch (ie --strip-blobs-bigger-than) then we can already scan the Git packfile in advance to work out the hash-ids of objects bigger than that limit, and display them to the user - but that only gives you the hash ids, not the file names or file paths, which is not very friendly.

Once we're talking about evaluating the results of other operations, ie the --replace-text flag, we're basically saying we have to run The BFG for real, but we don't want it to update refs (easy) or indeed write cleaned objects to the repo object database (gives me pause... it's possible, so long as none of the cleaners ever want to examine the contents of previously cleaned objects... should be ok, I think). So we're probably going to end up with a execution run-time that is very very close to just running the BFG for real, but it does at least mean the user doesn't have to wipe and re-copy their Git repo for every experiment.

Perhaps surprisingly, given the identical runtime, this does mean the user ends up in the position that in some ways they have less diagnostic information than if they're just run the BFG for real, because now they can't actually examine the cleaned commits... which means the diagnostic output from the BFG needs to be beefed up - although I'm quite proud of the diagnostic output that The BFG does supply, it could still do with improvement, ie some variant of the stuff in #14 (display diffs of changed content) and #15 (log detailed diagnostics to file) to make --dry-run genuinely useful.

tfnico · 2013-04-24T18:38:06Z

@rtyley Sounds great. It would certainly be cool to output diffs for replaced texts, and lists of files that have been deleted.

I think rewriting without changing the refs sounds like a fair compromise. Performance is smooth anyhow.

alistra · 2014-02-26T13:51:49Z

I second that feature, I'm just afraid to use it on a big company repo, without the dry run option

rtyley · 2014-02-26T14:02:59Z

I second that feature, I'm just afraid to use it on a big company repo, without the dry run option

Hi @alistra - just so I can better understand the use-case, is there any reason you can't just do a git clone --mirror on the repo, and run The BFG on that local copy?

alistra · 2014-02-26T14:05:50Z

it's harder to see the changes, I would have to manually browse around 6 branches (that are around 2 year old), would be nicer just to go through the changes list with branch/file pairs and check if we wouldn't delete something important accidentally.

Not all of the code in the repo is used all the time, so the mistake wouldn't be obvious right away.

rtyley · 2014-02-26T14:21:12Z

I would have to manually browse around 6 branches (that are around 2 year old), would be nicer just to go through the changes list with branch/file pairs and check if we wouldn't delete something important accidentally.

Would you want to check every single commit on those branches (which potentially could be a lot of very repetitive information) or would you be interested in just the tips of those branches, ie the latest commit on each branch?

alistra · 2014-02-26T14:28:37Z

Ideal solution would deduplicate the same files and tell me:

would remove file dir/dir2/foo [56 commits]
would remove file dir/bar [22 commits, 3 branches]

Then in order of usefulness I would like the checking the tips of branches, then the whole big dump of data

arturhoo · 2014-03-20T14:07:56Z

I don't necessarily need a --dry-run option, but it would be handy to have a list of the file names and paths that were removed after running bfg.

I am also afraid of accidentally deleting a big file that could be useful in the future (although not present in my most recent commit).

danijar · 2014-03-28T10:25:01Z

Dry run would be very useful in my opinion, too.

dandv · 2014-05-16T07:39:58Z

👍 for dry runs. I'm an intermediate git user, and what I'd like is to:

Simulate the bfg operation
Check out the repo as it would be after pushing it and cloning.

That way I can compare the bfg'ed repo with one of known quality and ensure files are fine.

This should make seeing what the BFG has got up to a lot easier, and will make a dry-run mode (still not implemented) much more useful. #26 #17 (comment) The new output looks like this (and it only appears if files *have* been changed, or deleted, as appropriate). ``` Changed files ------------- File Before After -------------------------------------------------- bushhidthefacts-ORIGINAL.txt | 93fd267a | 4f6f1558 Deleted files ------------- File git-id Size (bytes) ----------------------------------- video.mp4 | 294f4016 | 126384 ```

This should make seeing what the BFG has got up to a lot easier, and will make a dry-run mode (still not implemented) much more useful. #26 #17 (comment) The new output looks like this (and it only appears if files *have* been changed, or deleted, as appropriate). ``` Changed files ------------- Filename Before & After -------------------------------------------------------------------------------------------------- CODE.conf | e3aa4a56 ⇒ b9241055, 6fd90c18 ⇒ 1f390cd7, ... PROD.conf | 5a89032a ⇒ 38193000, 2611394d ⇒ 9c742f65, ... Deleted files ------------- Filename Git id ------------------------------------------------ bg.jpg | d0ea4091 (2.0 MB) guardian_space002.png | 24215b1e (1.2 MB) ```

winny- · 2015-06-11T15:20:54Z

👍 I literally downloaded bfg and looked for a --dry-run flag before actually deleting the blobs, not to find it.

lwcolton · 2015-10-13T17:03:23Z

+1

javabrett · 2015-10-13T23:20:50Z

As eluded-to in comments above, I think this request decomposes into:

Getting a simple objects-to-delete report, of what bfg can find and plans to-do, perhaps with some nice metrics thrown-in. This is similar to request Find big files only #26 for a report-only feature, except it targets large-blob identification only.
Doing something a little more than that, which borders on not really being a "dry run". Is the ask to actually do some rewriting, but not update refs? @rtyley describes this in his response.

Also as commented above, I'm trying to understand the advantage of a --dry-run over simply making a super-cheap local clone of the target repository, and running bfg on that as the dry-run. It's better than a dry-run, since you get a risk-free look at what bfg will actually output, with a detailed report, rather than a simple report of intent.

Since (on Linux anyway) git clone will by-default create hard-links when you clone the repo, that step is super-fast even for massive repos, but you can then run bfg on that clone as if it were independent to the original. Please note however that git clone will not give you an independent on-disk backup unless you specify the --no-hardlinks option to prevent hard-links from being created between your two local repos' object stores.

Say I pick a decent-size repo, the Linux Kernel, and run an academic clean-1M+ on it and time it. First I'll create a hard-linked local clone:

$ time git clone linux linux-bfg-test-run
Cloning into 'linux-bfg-test-run'...
done.
Checking out files: 100% (51567/51567), done.

real    0m4.562s
user    0m3.438s
sys 0m1.120s

... then run bfg:

~/git/linux-bfg-test-run(master) $ time java -jar ~/Downloads/bfg-1.12.5.jar -b 1M
...
Cleaning
--------

Found 547745 commits
Cleaning commits:       100% (547745/547745)
Cleaning commits completed in 268,733 ms.
...
Updating 288 Refs
-----------------
real    5m13.075s
user    10m37.739s
sys 1m47.214s

5 seconds for the clone, 5m13s for the bfg-run, total 5m18s.

Check the original repo and it is untouched. Check the object-store in linux-bfg-test-run and note that lots of objects have been unpacked due to the bfg rewrite. Run the recommended git reflog expire --expire=now --all && git gc --prune=now --aggressive followed by a git repack -a -d and note that new packs have been created, and the hard-link count on the original repo's packs has dropped from 2 to 1.

Compare this to a plain run on the repo:

~/git/linux $ time java -jar ~/Downloads/bfg-1.12.5.jar -b 1M
...
Cleaning
--------

Found 547745 commits
Cleaning commits:       100% (547745/547745)
Cleaning commits completed in 258,862 ms.

Updating 288 Refs
-----------------
...
real    5m1.818s
user    10m23.116s
sys 1m41.897s

About the same, 5m and change. So for the low-cost (5 seconds) of a local clone, you can do a real-test-run rather than a dry-run/report. Of course Windows users, not having hardlinks available, would have to wear the extra time and space cost of the initial clone. Also of course, you will have to pay for the disk-space usage as bfg writes to the test-repo.

So it feels like a native --dry-run option would only be attractive if in being a dry-run, a lot of bfg execution-time could be saved, versus the benefit of getting a real look at the output. I know that I would much prefer to see/test/inspect a real output before I run this for real, rather than a dry-run report.

Zitrax · 2016-03-15T10:10:44Z

I was also looking for a dry-run. But when not finding it I hoped the real run would print out some info, but I only saw a list of updated refs, it would be very helpful if it actually printed exactly what files/folders were deleted in addition. ( I was using --delete-folders )

Tails · 2017-04-15T14:00:00Z

My repo is huge and it would be nice to not have to make a copy of it. Very scared to run without a dry run!

javabrett · 2017-04-18T00:53:37Z

See also my comment above, but consider the following:

In almost all cases, there will be some shared and remote repo that will eventually be overwritten with the BFG-shrunk, rewritten repo. You will almost never want to run BFG on that remote - you can't run on a remote, and it's unlikely you'll want to remote-into that server and run BFG.
The BFG instructions suggest you make a mirror-clone of the remote as the first step.
Git makes it incredibly cheap to make additional, local clones of that clone, where you can perform full test-runs of BFG. Filesystems with hard-link support will make cheaper and faster local clones. Ergo there is a cheap mechanism for performing a full BFG run at very little cost, which can be discarded and repeated at-will.
Git's storage mechanism, and BFG's cleaning of the Git database, means that you really won't get a perfect feel for the outcome and eventual repo shrinkage without performing an actual run, rather than a do-nothing dry-run. Since shrinkage is often one of the main metrics, and creating a test-clone is so cheap, it is worth doing a "real" dry-run.
You won't replace the "real" remote repo, which is isolated, until you are happy with what you achieve locally.

That is, test local clones are incredibly cheap, it is necessary to run BFG to really see what it will achieve, ergo it is better to actually run it, and there is dubious value in a dry-run mechanism.

ghost · 2017-08-28T15:23:05Z

That is, test local clones are incredibly cheap...

I like this feature idea because sometimes ^^^ is not as true as we'd like -- I'm trying to remove junk data from a repo that is 3.9 GB when checked out. (Full disclosure: I didn't do it! I'm trying to fix it : )

...dubious value in a dry-run mechanism

I admit that even thought this feature would be nice it definitely falls under the "nice to have" category.

This should make seeing what the BFG has got up to a lot easier, and will make a dry-run mode (still not implemented) much more useful. rtyley/bfg-repo-cleaner#26 rtyley/bfg-repo-cleaner#17 (comment) The new output looks like this (and it only appears if files *have* been changed, or deleted, as appropriate). ``` Changed files ------------- Filename Before & After -------------------------------------------------------------------------------------------------- CODE.conf | e3aa4a56 ⇒ b9241055, 6fd90c18 ⇒ 1f390cd7, ... PROD.conf | 5a89032a ⇒ 38193000, 2611394d ⇒ 9c742f65, ... Deleted files ------------- Filename Git id ------------------------------------------------ bg.jpg | d0ea4091 (2.0 MB) guardian_space002.png | 24215b1e (1.2 MB) ```

lovesegfault · 2019-10-08T22:10:17Z

I'd like to reinforce the need to a dry-run mode, I'm cleaning up a massive repo and it's painful to have to clone it twice to use bfg

javabrett · 2019-10-08T22:23:36Z

@lovesegfault so we can put numbers to this ... what are the timings if a) first clone remote? To local then second clone local to local perhaps allowing hard links.

lovesegfault · 2019-10-09T01:08:02Z

@javabrett First clone takes a good 30mins

fabb · 2019-10-09T05:14:28Z

Maybe try git clone —reference for the second clone and run bfg on this second one for test? (I.e. don‘t run bfg on the first one so you can make a reference clone from it again later)

builder-main · 2021-07-10T15:36:05Z

Well, when working on tens of gigs repos (like Unity/Unreal) you'd be happy to have a --dry-run saving lots of time.
Meanwhile we'll try the --reference option.

rtyley mentioned this issue Dec 26, 2014

Report deleted & changed files at the end of the run #63

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request/idea: dry-run mode #17

Feature request/idea: dry-run mode #17

tfnico commented Apr 24, 2013

rtyley commented Apr 24, 2013

tfnico commented Apr 24, 2013

alistra commented Feb 26, 2014

rtyley commented Feb 26, 2014

alistra commented Feb 26, 2014

rtyley commented Feb 26, 2014

alistra commented Feb 26, 2014

arturhoo commented Mar 20, 2014

danijar commented Mar 28, 2014

dandv commented May 16, 2014

winny- commented Jun 11, 2015

lwcolton commented Oct 13, 2015

javabrett commented Oct 13, 2015

Zitrax commented Mar 15, 2016

Tails commented Apr 15, 2017

javabrett commented Apr 18, 2017

ghost commented Aug 28, 2017

lovesegfault commented Oct 8, 2019

javabrett commented Oct 8, 2019

lovesegfault commented Oct 9, 2019

fabb commented Oct 9, 2019

builder-main commented Jul 10, 2021

Feature request/idea: dry-run mode #17

Feature request/idea: dry-run mode #17

Comments

tfnico commented Apr 24, 2013

rtyley commented Apr 24, 2013

tfnico commented Apr 24, 2013

alistra commented Feb 26, 2014

rtyley commented Feb 26, 2014

alistra commented Feb 26, 2014

rtyley commented Feb 26, 2014

alistra commented Feb 26, 2014

arturhoo commented Mar 20, 2014

danijar commented Mar 28, 2014

dandv commented May 16, 2014

winny- commented Jun 11, 2015

lwcolton commented Oct 13, 2015

javabrett commented Oct 13, 2015

Zitrax commented Mar 15, 2016

Tails commented Apr 15, 2017

javabrett commented Apr 18, 2017

ghost commented Aug 28, 2017

lovesegfault commented Oct 8, 2019

javabrett commented Oct 8, 2019

lovesegfault commented Oct 9, 2019

fabb commented Oct 9, 2019

builder-main commented Jul 10, 2021