Analyze your git repo for developers with unique knowledge.
Latest commit 286dbe9 Mar 20, 2014 @tomheon Update
Failed to load latest commit information.
.gitignore Update Mar 21, 2014 Initial commit. Jan 5, 2011 This should really be a function. Jan 6, 2011 Initial commit. Jan 5, 2011 Fix svn error that crept in with symlinks. Jan 6, 2011 Output to correct directory regardless of script origination Jan 11, 2011

Git by a Bus: Estimate unique and at-risk knowledge in your source code.


To the folks who seem to be using this lately: there a bunch of pathological things that happen when you run it on a large repo. I started a rewrite that fixes that long ago (in the v2 branch), but it peetered out--however I'll take this surge of interest as an excuse to get off my butt and fix the thing. In the meantime, sorry to those of you who've had python OOM on you!


I originally developed Git by a Bus at HubSpot, who kindly allowed me to open source it.

Motivation and description of the algorithm are here:


If you find any, let me know on github or by email at

Installation and Dependencies

The entire project is written in python, known to work with 2.6.1 and 2.6.5. The python isn't very exotic, so it may work with earlier versions as well. Let me know if you problems or successes with other versions.

Git by a Bus requires a locally executable git (unless you are using svn) and local git repositories of each project you want to analyze.

If you are using the experimental svn support, you must have pysvn installed and available in your PYTHONPATH (to test try running python from the command line and 'import pysvn'. If it works you should be all set). I've tested with pysvn version 1.7.4, but again the usage of pysvn isn't very exotic, so earlier versions might work. You should see the subversion notes below if you really want to use svn.


The driver file is, which you should run with "python ". Run it with the -h flag to get a list of options and more detailed usage. See the partial re-runs section below for tips about cheaper re-runs.


The driver runs the following scripts in sequence:





Each produces a corresponding tsv in the output directory (e.g. produces output/gen_file_stats.tsv), which is used as input to the next step.

The file produces an html summary in output/index.html and output/{devs,projects,files}.

Partial Re-Runs

Sometimes you want to re-run with a different set of bus risks or lists of departed devs. Or maybe you want to exclude some files you forgot to mark as uninteresting and crunch again. Given that the step is by far the slowest, you might want to try a little file manipulation to speed up the process.

If you run with the -c flag, will only try to regenerate missing files. So for example you could adjust the contents of output/gen_file_stats.tsv, remove all the other tsvs in output, and run the command again with an added -c flag, in which case will notice the existing gen_file_stats.tsv and not attempt to re-generate it. See the output section above for a full list of outputs.

Subversion Notes

Git by a Bus has experimental support for svn. It uses svn urls instead of local repositories, and it's orders of magnitude slower than the git version. For example, a moderately sized local repository here at HubSpot takes about 2 minutes to anazlyze using git, and about 90 minutes to analyze using svn. It also puts a fair amount of strain on the svn server, as it has to run a full log for each file and a diff between each sequential version of each file.

If you have an svn repository you want to analyze, I suggest using "git svn" to convert the svn repository to git and then analyzing the git repository, since it will be much faster on repeated runs. The git svn bridge puts a fair amount of load on the svn server during conversion as well, so for large and busy repos you might want to consider making a local copy with svnadmin hotcopy and then converting to git from the copy.