Incremental Map-Reduce on Repository History
This repository uses sbt and contains the following sub-projects:
- Core contains the minimal functionality, the native library and the basic engine implementations.
- Git contains the connector to Git repositories and the corresponding library.
- Spark contains the implementation of the first Spark engine.
- Apps contains first applications cases (in migration).
- Eval contains the test and measurement utils (in migration).
Check out this repository or add the following dependencies to your sbt file (currently just available for scala 2.12).
scalaVersion := "2.12.10"
libraryDependencies += "org.topleet" %% "core" % "0.1.0"
libraryDependencies += "org.topleet" %% "git" % "0.1.0" // Optional Git
libraryDependencies += "org.topleet" %% "spark" % "0.1.0" // Optional SparkThe scala object Sandbox in the Apps project serves as a good template to start with.
The code imports the Natives and Gits library providing all basic methods for processing GitHub repositories.
import org.topleet.libs.Natives._
import org.topleet.libs.Gits._Second, the code initializes a parallel and incremental engine which is used as the underlying processing infrastructure realizing the library calls.
import org.topleet.Engine
import org.topleet.engines.IncrementalParallelEngine
implicit val engine: Engine = IncrementalParallelEngine.create()Hereafter, the actual processing can start with the initialization of
the first data structure of type Leet.
import org.topleet.git.SHA
import org.topleet.Leet
val shas: Leet[SHA, Single[SHA]] = git("jwtk/jjwt")By using the provided git library method and passing a repository owner and
name, a data structure of type Leet[SHA, Single[SHA]] is created which reflects
the acyclic history of the Git repository.
Each commit, which is identified by the first generic SHA, is assigned to its
own SHA given by the second generic Single[SHA].
The content and the graph structure of the data structure can already be
examined by invoking show(), index(), nodes() and edges() on shas.
In most application, such initialization is followed by an extraction
of available resources on every commit.
The Git specific resource access is enable by method resources(),
just provided for data structures of type Leet[SHA, Single[SHA]].
import org.topleet.git.Resource
val resources: Leet[SHA, Bag[(Path, Resource)]] = shas.resources()This data structure now assigns commits to bags
of path-resource tuples Bag[(Path, Resource)]. Such data structure
is a convenient starting point for any further processing
involving filter, join, cartesian, map, flatten, sum or length (count) calls.
The following code counts the number
of lines in java files without single line comments. (In Sandbox, JDT is used to extract the cyclomatic complexity.)
val loc: Leet[SHA, Single[Int]] = resources
.filter { case (path, _) => path.endsWith(".java") }
.map { case (_, resource) => bag(resource.read().split("\n")) }
.flatten()
.filter(line => !line.startsWith("//"))
.length()The final content of the data structure can be accessed using the index() call.
val iterator: Iterator[(SHA, Single[Int])] = loc.index()
for ((sha, metric) <- iterator)
println("sha " + sha + " with values " + metric)A Spark engine (including GraphPartition and NodeAlias optimizations) can be created using the following constructor. A SparkContext and the assumed number of commits need to be handed as parameters.
import org.topleet.engines.SparkEngine
val spark: SparkContext = ...
val ncommits: Int = ...
implicit val engine: Engine = SparkEngine.create(ncommits, spark)The content of the background graph can be visualized using
call .show() on any Leet data structure (produced by the Gits library).
The call generates a html file that is opened by
the default browser. Parts of the behavior can be adapted in the call, e.g., orientation or edge labels.
Zooming and moving is possible.
Showing the initial data structure of the previous sections by shas.show()
results in the following window opening in the browser (Viewer). While hovering over nodes,
meta-data appears, and a link to the commit on GitHub can be navigated.
A data structure might get large, and the recent visualization might hit limitations showing extraordinary large graphs. However, the graph viewer employs a trick called "graph contraction" to only show edges and nodes where change happen.
The following graph shows an evolving cyclomatic complexity metric summed over all files of a repository (Viewer). Nodes that are connected by an edge not changing the metric, are 'merged' into a single node (we refer to this as 'contracted'). This decreases the size of the visualized graph. The annotated meta-data of such nodes is special as it lists all contracted commits. In such list, the first commit (owner) did the actual change, and the other commits just follow without further changes.
Presentation slides can be found here and the corresponding SANER paper can be found here with the following BibTeX entry.
@inproceedings{DBLP:conf/wcre/HartelL20,
author = {Johannes H{\"{a}}rtel and
Ralf L{\"{a}}mmel},
title = "{Incremental Map-Reduce on Repository History}",
booktitle = {{SANER}},
pages = {320--331},
publisher = {{IEEE}},
year = {2020}
}

