jgit-spark-connector is a library for running scalable data retrieval pipelines that process any number of Git repositories for source code analysis.
It is written in Scala and built on top of Apache Spark to enable rapid construction of custom analysis pipelines and processing large number of Git repositories stored in HDFS in Siva file format. It is accessible both via Scala and Python Spark APIs, and capable of running on large-scale distributed clusters.
Current implementation combines:
- src-d/enry to detect programming language of every file
- bblfsh/client-scala to parse every file to UAST
- src-d/siva-java for reading Siva files in JVM
- apache/spark to extend DataFrame API
- eclipse/jgit for working with Git .pack files
jgit-spark-connector has been deprecated in favor of gitbase-spark-connector and there will be no further development of this tool.
First, you need to download Apache Spark somewhere on your machine:
$ cd /tmp && wget "https://www.apache.org/dyn/mirrors/mirrors.cgi?action=download&filename=spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz" -O spark-2.2.1-bin-hadoop2.7.tgz
The Apache Software Foundation suggests you the better mirror where you can download Spark
from. If you wish to take a look and find the best option in your case, you can do it here.
Then you must extract Spark
from the downloaded tar file:
$ tar -C ~/ -xvzf spark-2.2.1-bin-hadoop2.7.tgz
Binaries and scripts to run Spark
are located in spark-2.2.1-bin-hadoop2.7/bin, so should set PATH
and SPARK_HOME
to point to this directory. It's advised to add this to your shell profile:
$ export SPARK_HOME=$HOME/spark-2.2.1-bin-hadoop2.7
$ export PATH=$PATH:$SPARK_HOME/bin
Look for the latest jgit-spark-connector version, and then replace in the command where [version]
is showed:
$ spark-shell --packages "tech.sourced:jgit-spark-connector:[version]"
# or
$ pyspark --packages "tech.sourced:jgit-spark-connector:[version]"
Run bblfsh daemon. You can start it easily in a container following its quick start guide.
If you run jgit-spark-connector in an UNIX like environment, you should set the LANG
variable properly:
export LANG="en_US.UTF-8"
The rationale behind this is that UNIX file systems don't keep the encoding for each file name, they are just plain bytes,
so the Java API for FS
looks for the LANG
environment variable to apply certain encoding.
Either in case the LANG
variable wouldn't be set to a UTF-8 encoding or it wouldn't be set at all (which results in handle encoding in C locale) you could get an exception during the jgit-spark-connector execution similar to java.nio.file.InvalidPathException: Malformed input or input contains unmappable characters
.
- Scala 2.11.x
- Apache Spark Installation 2.2.x or 2.3.x
- bblfsh >= 2.5.0: Used for UAST extraction
- Python >= 3.4.x (jgit-spark-connector is tested with Python 3.4, 3.5 and 3.6 and these are the supported versions, even if it might still work with previous ones)
libxml2-dev
installedpython3-dev
installedg++
installed
jgit-spark-connector is available on maven central. To add it to your project as a dependency,
For projects managed by maven add the following to your pom.xml
:
<dependency>
<groupId>tech.sourced</groupId>
<artifactId>jgit-spark-connector</artifactId>
<version>[version]</version>
</dependency>
For sbt managed projects add the dependency:
libraryDependencies += "tech.sourced" % "jgit-spark-connector" % "[version]"
In both cases, replace [version]
with the latest jgit-spark-connector version
The default jar published is a fatjar containing all the dependencies required by the jgit-spark-connector. It's meant to be used directly as a jar or through --packages
for Spark usage.
If you want to use it in an application and built a fatjar with that you need to follow these steps to use what we call the "slim" jar:
With maven:
<dependency>
<groupId>tech.sourced</groupId>
<artifactId>jgit-spark-connector</artifactId>
<version>[version]</version>
<classifier>slim</classifier>
</dependency>
Or (for sbt):
libraryDependencies += "tech.sourced" % "jgit-spark-connector" % "[version]" % Compile classifier "slim"
If you run into problems with io.netty.versions.properties
on sbt, you can add the following snippet to solve it:
In sbt:
assemblyMergeStrategy in assembly := {
case "META-INF/io.netty.versions.properties" => MergeStrategy.last
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}
Install python-wrappers is necessary to use jgit-spark-connector from pyspark:
$ pip install sourced-jgit-spark-connector
Then you should provide the jgit-spark-connector's maven coordinates to the pyspark's shell:
$ $SPARK_HOME/bin/pyspark --packages "tech.sourced:jgit-spark-connector:[version]"
Replace [version]
with the latest jgit-spark-connector version
Install jgit-spark-connector wrappers as in local mode:
$ pip install -e sourced-jgit-spark-connector
Then you should package and compress with zip
the python wrappers to provide pyspark with it. It's required to distribute the code among the nodes of the cluster.
$ zip <path-to-installed-package> ./sourced-jgit-spark-connector.zip
$ $SPARK_HOME/bin/pyspark <same-args-as-local-plus> --py-files ./sourced-jgit-spark-connector.zip
Run pyspark as explained before to start using the jgit-spark-connector, replacing [version]
with the latest jgit-spark-connector version:
$ $SPARK_HOME/bin/pyspark --packages "tech.sourced:jgit-spark-connector:[version]"
Welcome to
spark version 2.2.1
Using Python version 3.6.2 (default, Jul 20 2017 03:52:27)
SparkSession available as 'spark'.
>>> from sourced.engine import Engine
>>> engine = Engine(spark, '/path/to/siva/files', 'siva')
>>> engine.repositories.filter('id = "github.com/mingrammer/funmath.git"').references.filter("name = 'refs/heads/HEAD'").show()
+--------------------+---------------+--------------------+
| repository_id| name| hash|
+--------------------+---------------+--------------------+
|github.com/mingra...|refs/heads/HEAD|290440b64a73f5c7e...|
+--------------------+---------------+--------------------+
You must provide jgit-spark-connector as a dependency in the following way, replacing [version]
with the latest jgit-spark-connector version:
$ spark-shell --packages "tech.sourced:jgit-spark-connector:[version]"
To start using jgit-spark-connector from the shell you must import everything inside the tech.sourced.engine
package (or, if you prefer, just import Engine
and EngineDataFrame
classes):
scala> import tech.sourced.engine._
import tech.sourced.engine._
Now, you need to create an instance of Engine
and give it the spark session and the path of the directory containing the siva files:
scala> val engine = Engine(spark, "/path/to/siva-files", "siva")
Then, you will be able to perform queries over the repositories:
scala> engine.getRepositories.filter('id === "github.com/mawag/faq-xiyoulinux").
| getReferences.filter('name === "refs/heads/HEAD").
| getAllReferenceCommits.filter('message.contains("Initial")).
| select('repository_id, 'hash, 'message).
| show
+--------------------------------+-------------------------------+--------------------+
| repository_id| hash| message|
+--------------------------------+-------------------------------+--------------------+
|github.com/mawag/...|fff7062de8474d10a...|Initial commit|
+--------------------------------+-------------------------------+--------------------+
As you might have seen, you need to provide the repository format you will be reading when you create the Engine
instance. Although the documentation always uses the siva
format, there are more repository formats available.
These are all the supported formats at the moment:
siva
: rooted repositories packed in a single.siva
file.standard
: regular git repositories with a.git
folder. Each in a folder of their own under the given repository path.bare
: git bare repositories. Each in a folder of their own under the given repository path.
There are some design decisions that may surprise the user when processing local repositories, instead of siva files. This is the list of things you should take into account when doing so:
- All local branches will belong to a repository whose id is
file://$REPOSITORY_PATH
. So, if you clonehttps://github.com/foo/bar.git
at/home/foo/bar
, you will see two repositoriesfile:///home/foo/bar
andgithub.com/foo/bar
, even if you only have one. - Remote branches are transformed from
refs/remote/$REMOTE_NAME/$BRANCH_NAME
torefs/heads/$BRANCH_NAME
as they will only belong to the repository id of their corresponding remote. Sorefs/remote/origin/HEAD
becomesrefs/heads/HEAD
.
You can launch our docker container which contains some Notebooks examples just running:
docker run --name jgit-spark-connector-jupyter --rm -it -p 8080:8080 -v $(pwd)/path/to/siva-files:/repositories --link bblfshd:bblfshd srcd/jgit-spark-connector-jupyter
You must have some siva files in local to mount them on the container replacing the path $(pwd)/path/to/siva-files
. You can get some siva-files from the project here.
You should have a bblfsh daemon container running to link the jupyter container (see Pre-requisites).
When the jgit-spark-connector-jupyter
container starts it will show you an URL that you can open in your browser.
If you are using the jgit-spark-connector directly from Python and are unable to modify the PYTHON_SUBMIT_ARGS
you can copy the jgit-spark-connector jar to the pyspark jars to make it available there.
cp jgit-spark-connector.jar "$(python -c 'import pyspark; print(pyspark.__path__[0])')/jars"
This way, you can use it in the following way:
import sys
pyspark_path = "/path/to/pyspark/python"
sys.path.append(pyspark_path)
from pyspark.sql import SparkSession
from sourced.engine import Engine
siva_folder = "/path/to/siva-files"
spark = SparkSession.builder.appName("test").master("local[*]").getOrCreate()
engine = Engine(spark, siva_folder, 'siva')
Build the fatjar is needed to build the docker image that contains the jupyter server, or test changes in spark-shell just passing the jar with --jars
flag:
$ make build
It leaves the fatjar in target/scala-2.11/jgit-spark-connector-uber.jar
To build an image with the last built of the project:
$ make docker-build
Notebooks under examples folder will be included on the image.
To run a container with the Jupyter server:
$ make docker-run
Before run the jupyter container you must run a bblfsh daemon:
$ make docker-bblfsh
If it's the first time you run the bblfsh daemon, you must install the drivers:
$ make docker-bblfsh-install-drivers
To see installed drivers:
$ make docker-bblfsh-list-drivers
To remove the development jupyter image generated:
$ make docker-clean
jgit-spark-connector uses bblfsh so you need an instance of a bblfsh server running:
$ make docker-bblfsh
To run tests:
$ make test
To run tests for python wrapper:
$ cd python
$ make test
There is no windows support in enry-java or bblfsh's client-scala right now, so all the language detection and UAST features are not available for the windows platform.
Apache License Version 2.0, see LICENSE