This is a utility for filtering a git branch.
What, how, why?
It outputs a new branch with only the things we want included, in all revisions.
It writes a new history preserving the part of it which is relevant for the new
It resulted from the efforts to do the same thing using
and then directly using git plumbing commands when
was found to be too slow.
On my test repository with 100k commits the original
git filter-branch based filtering took around 2 days to finish one filter set. The same thing with
git_filter takes a minute on the same hardware.
The input is a positive list of files and directories to be included.
The tool can make several filterings simultaneously. This is normal, of course, where you want to split a large git up into smaller ones and you want two or more disjoint sets of data which together contain the whole original repository.
git_filter produces a lot of loose objects when it is finished,
so it is a very good idea to repack the repository when it is done
git repack -ad)
before continuing working with the resultant repository.
In addition to the new branches
git_filter outputs a
.revinfo text file
per branch with a line per new revision showing correspondance to the
original revision it is derived from. The purpose of this is to allow
recreation of tag information.
The purpose of the
git_filter program for me was to generate final
repositories which contain none of the original commits.
To do this I needed to do some further work.
push_clean_repos script creates a clean repository for each of the
filtered branches generated by the
Each new repository has the same name as the corresponding branch.
It takes the same configuration file as argument as
newtags.py uses the
.revinfo files from
tag information in the source repository to map the tags in the source
to each of the destination repositories.
I have a git repository repo I want to split up. It is located in the current directory.
./git_filter git_filter.cfg && ./push_clean_repos git_filter.cfg
git_filter saves the necessary state (in the
.git directory) to allow
a full history processing to be resumed without generating all the initial
We can run it once on the entire history and then run it incrementally on
new commits and produce the same result as starting from scratch each time.
This results in much shorter processing times.
git_filter to do this by adding the option
continue on the command
line after the configuration file, thus:
./git_filter git_filter.cfg continue
Building the script
Just a plain
should be enough to build the
It automatically downloads libgit2
and builds it as part of the process. It has been tested to compile on
Mac (with Xcode installed) and on Ubuntu Linux.
Neither of these systems had a pre installed
Config File Syntax
Look at the
filter.cfg example, it is commented.
Config items and data
The config file parser is very simple, so a single space is the only allowed
separator. The parameter names should be exactly 4 characters followed by colon
and a space. Lines beginning with a
# are comment lines and are ignored.
The configuration file should contain one REPO tag with the location of the repository to filter.
REVN: [range|ref] <refspec>
A revision specification. Either a range, e.g.
master~1000..masteror a (branch) reference, e.g.
A base directory for the filter file lists.
FILT: <name> <file>
Space separated name and filter file pair.
TPFX: <tag prefix>
The prefix for tags and output repo names, prepended to the filter set name.